Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines




On Thu, 12 Aug 2021 at 12:02, Raymond Wan <rwan.kyoto@example.com> wrote:
On Wed, Aug 11, 2021 at 4:28 PM Travis Cardwell wrote:
> My problem is straightforward.  Given a string containing a paragraph of
> text with "soft" line breaks, I want to output a string containing the
> text without line breaks.  The way that lines are joined depends on the
> language.  Many languages such as English require spaces, while many
> languages such as Japanese do not use spaces.

Does the language need to be detected by analysing the text?

Or can the language be an input parameter?

If it can be an input parameter this is quite trivial.

For languages without whitespace separation between words:
Simply process the input to delete the soft line breaks.

For languages with whitespace separation between words:
Simply process the input to replace the soft line breaks with whitespace
UNLESS that soft-break is preceded or followed by a whitespace, in which
case you can simply delete the soft-break.

If you need to detect the language, you might want to do this as a
preparatory step before doing the above.


Home | Main Index | Thread Index