Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Unicode/ICU question about joining lines
- Date: Thu, 12 Aug 2021 10:52:45 +0800
- From: Raymond Wan <rwan.kyoto@example.com>
- Subject: Re: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com>
Hi Travis, On Wed, Aug 11, 2021 at 4:28 PM Travis Cardwell <travis.cardwell@example.com> wrote: > My problem is straightforward. Given a string containing a paragraph of > text with "soft" line breaks, I want to output a string containing the > text without line breaks. The way that lines are joined depends on the > language. Many languages such as English require spaces, while many > languages such as Japanese do not use spaces. I'm not sure what you mean here. Do you mean, given a paragraph of text with line breaks, you want to (1) join them all up into one long line and then (2) break it up in a language-dependent way that "makes sense" to people? > It seems like such functionality should exist. Perhaps there is a > different ICU property that can be used? What am I missing? Does > anybody know of a better and/or easier way to solve this problem? If so, then for English, I guess you wouldn't simply break sentences using whitespaces. Instead, you might want to use punctuation marks (i.e., break after the semicolon, comma, and full stop) as a guide. I guess you want to preserve phrases like "in the park" and not break after "the"? While Japanese doesn't have spaces, I guess you also want to break a Japanese sentence in the same way. Using punctuation marks as a guide, for example. But if you want to do it well, I think for both languages, you should try to put each sentence through a Natural Language Processing engine so that you can isolate all the noun phrases, verb phrases, etc. And then break after each unit. So...I don't think this is related to ICU but maybe I've misunderstood your problem? Ray
- Follow-Ups:
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Benjamin Kowarsch
- References:
- [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):