TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Unicode/ICU question about joining lines

Date: Thu, 12 Aug 2021 12:36:50 +0900

From: Travis Cardwell <travis.cardwell@example.com>

Subject: Re: [tlug] Unicode/ICU question about joining lines

References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com> <CAAhy3dufsFDgNaF0V5yq0-VxKSyC5kkF4kF7dLabnYKk8o67rQ@mail.gmail.com>
Hi Ray!

On Thu, Aug 12, 2021 at 12:06 PM Raymond Wan wrote:
> I'm not sure what you mean here.  Do you mean, given a paragraph of
> text with line breaks, you want to (1) join them all up into one long
> line and then (2) break it up in a language-dependent way that "makes
> sense" to people?

In my attempt to keep my question concise, I was not very clear.  Sorry
about that!  Please see my reply to Jim, in which I provide some
examples.

I want to join multiple lines of a "wrapped" paragraph into a single
long line.  Since some languages separate words with spaces while others
do not, this process is language-dependent.

> If so, then for English, I guess you wouldn't simply break sentences
> using whitespaces.  Instead, you might want to use punctuation marks
> (i.e., break after the semicolon, comma, and full stop) as a guide.  I
> guess you want to preserve phrases like "in the park" and not break
> after "the"?
>
> While Japanese doesn't have spaces, I guess you also want to break a
> Japanese sentence in the same way.  Using punctuation marks as a
> guide, for example.
>
> But if you want to do it well, I think for both languages, you should
> try to put each sentence through a Natural Language Processing engine
> so that you can isolate all the noun phrases, verb phrases, etc.  And
> then break after each unit.
>
> So...I don't think this is related to ICU but maybe I've misunderstood
> your problem?

Breaking text is not what I am trying to do.  One could indeed use NLP
for better quality results, but note that breaking text is supported by
ICU.

https://unicode-org.github.io/icu/userguide/boundaryanalysis/
http://www.unicode.org/reports/tr14/
http://www.unicode.org/reports/tr29/

Cheers,

Travis
References:

[tlug] Unicode/ICU question about joining lines
From: Travis Cardwell

Re: [tlug] Unicode/ICU question about joining lines
From: Raymond Wan

Prev by Date: Re: [tlug] Unicode/ICU question about joining lines

Next by Date: Re: [tlug] Unicode/ICU question about joining lines

Previous by thread: Re: [tlug] Unicode/ICU question about joining lines

Next by thread: Re: [tlug] Unicode/ICU question about joining lines

Index(es):

Date

Thread

Home | Main Index | Thread Index