Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



Hi Travis,


On Wed, Aug 11, 2021 at 4:28 PM Travis Cardwell
<travis.cardwell@example.com> wrote:
> My problem is straightforward.  Given a string containing a paragraph of
> text with "soft" line breaks, I want to output a string containing the
> text without line breaks.  The way that lines are joined depends on the
> language.  Many languages such as English require spaces, while many
> languages such as Japanese do not use spaces.


I'm not sure what you mean here.  Do you mean, given a paragraph of
text with line breaks, you want to (1) join them all up into one long
line and then (2) break it up in a language-dependent way that "makes
sense" to people?


> It seems like such functionality should exist.  Perhaps there is a
> different ICU property that can be used?  What am I missing?  Does
> anybody know of a better and/or easier way to solve this problem?

If so, then for English, I guess you wouldn't simply break sentences
using whitespaces.  Instead, you might want to use punctuation marks
(i.e., break after the semicolon, comma, and full stop) as a guide.  I
guess you want to preserve phrases like "in the park" and not break
after "the"?

While Japanese doesn't have spaces, I guess you also want to break a
Japanese sentence in the same way.  Using punctuation marks as a
guide, for example.

But if you want to do it well, I think for both languages, you should
try to put each sentence through a Natural Language Processing engine
so that you can isolate all the noun phrases, verb phrases, etc.  And
then break after each unit.

So...I don't think this is related to ICU but maybe I've misunderstood
your problem?

Ray


Home | Main Index | Thread Index