Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



Travis Cardwell <travis.cardwell@example.com> wrote:
> My goal is to create a function that determines how to join the
> lines/fragments of text automatically, based on the content.  In my
> first post, I included some code that does this based on the Unicode
> block of neighboring characters.  This strategy works, but it requires
> classifying the many Unicode blocks, and I hoped that there is an
> easier way.

Do you accept multi-lingual text? If not, then a simple hack would be to just
look for spaces in the input text and classify the language accordingly. The
probability of mis-detection should decrease exponentially with the input
length.

Of course, even J語 does sometimes contain spaces in practice, simply as a
mistake or as a kind of "scare quote" emphasis around words.


Home | Main Index | Thread Index