Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Unicode/ICU question about joining lines
- Date: Thu, 12 Aug 2021 12:36:50 +0900
- From: Travis Cardwell <travis.cardwell@example.com>
- Subject: Re: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com> <CAAhy3dufsFDgNaF0V5yq0-VxKSyC5kkF4kF7dLabnYKk8o67rQ@mail.gmail.com>
Hi Ray! On Thu, Aug 12, 2021 at 12:06 PM Raymond Wan wrote: > I'm not sure what you mean here. Do you mean, given a paragraph of > text with line breaks, you want to (1) join them all up into one long > line and then (2) break it up in a language-dependent way that "makes > sense" to people? In my attempt to keep my question concise, I was not very clear. Sorry about that! Please see my reply to Jim, in which I provide some examples. I want to join multiple lines of a "wrapped" paragraph into a single long line. Since some languages separate words with spaces while others do not, this process is language-dependent. > If so, then for English, I guess you wouldn't simply break sentences > using whitespaces. Instead, you might want to use punctuation marks > (i.e., break after the semicolon, comma, and full stop) as a guide. I > guess you want to preserve phrases like "in the park" and not break > after "the"? > > While Japanese doesn't have spaces, I guess you also want to break a > Japanese sentence in the same way. Using punctuation marks as a > guide, for example. > > But if you want to do it well, I think for both languages, you should > try to put each sentence through a Natural Language Processing engine > so that you can isolate all the noun phrases, verb phrases, etc. And > then break after each unit. > > So...I don't think this is related to ICU but maybe I've misunderstood > your problem? Breaking text is not what I am trying to do. One could indeed use NLP for better quality results, but note that breaking text is supported by ICU. https://unicode-org.github.io/icu/userguide/boundaryanalysis/ http://www.unicode.org/reports/tr14/ http://www.unicode.org/reports/tr29/ Cheers, Travis
- References:
- [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Raymond Wan
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):