Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



Hi Ray!

On Thu, Aug 12, 2021 at 12:06 PM Raymond Wan wrote:
> I'm not sure what you mean here.  Do you mean, given a paragraph of
> text with line breaks, you want to (1) join them all up into one long
> line and then (2) break it up in a language-dependent way that "makes
> sense" to people?

In my attempt to keep my question concise, I was not very clear.  Sorry
about that!  Please see my reply to Jim, in which I provide some
examples.

I want to join multiple lines of a "wrapped" paragraph into a single
long line.  Since some languages separate words with spaces while others
do not, this process is language-dependent.

> If so, then for English, I guess you wouldn't simply break sentences
> using whitespaces.  Instead, you might want to use punctuation marks
> (i.e., break after the semicolon, comma, and full stop) as a guide.  I
> guess you want to preserve phrases like "in the park" and not break
> after "the"?
>
> While Japanese doesn't have spaces, I guess you also want to break a
> Japanese sentence in the same way.  Using punctuation marks as a
> guide, for example.
>
> But if you want to do it well, I think for both languages, you should
> try to put each sentence through a Natural Language Processing engine
> so that you can isolate all the noun phrases, verb phrases, etc.  And
> then break after each unit.
>
> So...I don't think this is related to ICU but maybe I've misunderstood
> your problem?

Breaking text is not what I am trying to do.  One could indeed use NLP
for better quality results, but note that breaking text is supported by
ICU.

https://unicode-org.github.io/icu/userguide/boundaryanalysis/
http://www.unicode.org/reports/tr14/
http://www.unicode.org/reports/tr29/

Cheers,

Travis


Home | Main Index | Thread Index