Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



> I have a Unicode question, and I am posting to this mailing list because
> it appears that the Lingo list is not used these days.

The [unicode] and/or [icu] tags on StackOverflow might help. A quick
scan looks there is more questions about going the other way, though.

> My problem is straightforward.  Given a string containing a paragraph of
> text with "soft" line breaks, I want to output a string containing the
> text without line breaks.  The way that lines are joined depends on the
> language.  Many languages such as English require spaces, while many
> languages such as Japanese do not use spaces.

Do you need to handle all possible languages? I'd probably start by
making the "" rule for lines that end in the CJK block characters, as
well as "-", and use " " as the default rule for all other characters,
and adapt that as people complain.

But if your source text contains hyphens at the end of lines, people
will start complaining very quickly. Knowing the difference between a
non-hyphenated word that got split with a hyphen, and a hyphenated word
that got split at its hyphen, is a big jump in complexity.

Darren


Home | Main Index | Thread Index