Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Unicode/ICU question about joining lines
- Date: Thu, 12 Aug 2021 17:42:12 +0900
- From: Travis Cardwell <travis.cardwell@example.com>
- Subject: Re: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com> <CABHGxq5Ma58RaaruwK6x+5o_vBh66fkjHnTjWYpaMC5FYOgOTg@mail.gmail.com> <CACaJP_RQusGfT6BFPP2SDh9oHtd3Y34eEmiHNdJArhwKbg4EdQ@mail.gmail.com> <CABHGxq5YcO1HPOOA_8yuhEmKeubNUfn8cS35OuB3MLMTamjYBQ@mail.gmail.com> <CACaJP_TvuwMQ3T6zqVxmgHeqzAV+i=hdeykq7EgmDayN4GNNvQ@mail.gmail.com> <YRTRNBM86fgWvvcu@telephonic.cynic.net>
Hi Curt! On Thu, Aug 12, 2021 at 5:01 PM Curt J. Sampson wrote: > I've thought on this a bit and I think you're wrong that Unicode would have > anything to say about this. The main issue is that the use of whitespace is > a lanugage-specific issue and Unicode _does not deal with langauge issues > or even markup_, only character encoding issues. In respect to the Unicode encoding, this is true. The Unicode Character Database (UCD) is also part of the Unicode Standard, however, and it does deal with languages. It defines metadata (properties) for characters, and the International Components for Unicode (ICU) library provides an API that makes use of these properties to implement functions used in the implementation of software I18N, not just encoding. For example, ICU includes translations, dictionary lookup functionality (required for correct segmentation in some languages), etc. > This is most obvious in the Han unification[1] of CJK ideographs, but is > even in western languages if you think about it: we use the same \u0065 'e' > for all Latin-script languages, rather than having a different 'e' for > Turkish, despite that Turkish and its related languages have their own > unique alphabet that is both missing letters in other European alphabets > (no 'q', 'w' or 'x') and has letters that don't exist in other European > alphabets ('ı', 'Ş' etc.). > > Imagine a soft newline between every word of the following two phrases. > Note that "com" and "org" in the two texts are *not* in the same language, > though they are the same string: > > comとorgと言うドメインは... > com and org domains are... > > Seems easy enough: just look and say that if _either_ side has a Japanese > character, it must be Japanese language, right? But oops: > > The Japanese character と is used for... There is indeed no way to implement a function that works with such mixed scripts, as different people have different conventions. There are also languages that do not put spaces around words but instead use spaces around punctuation. I think that it would be worthwhile to implement line joining that at least handles the simple cases, however. In cases where the lines are not joined correctly, users can adjust the source to fix the issue. Note that my prototype code inserts a space if either side has is not a Japanese character. I prefer to err on the side of adding spaces because languages without spaces just look poorly formatted when spaces are added, while languages with spaces can be very difficult to read without them. > When considering this whole thing, it's probably also a hint that Unicode > has (as far as I know) no character for a "soft" newline. And rightfully > so, a soft newline sometimes isn't even a single character but instead lack > of a sequence of characters. (E.g., in Markdown a newline that is _not_ > followed by another newline is a soft break rather than a paragraph break.) > > [1]: https://en.wikipedia.org/wiki/Han_unification The soft line break is an artifact of the source markup language that I am using and is unrelated to the core problem. The goal is to join fragments of text: foldText :: [Text] -> Text Neither the fragments nor the return value contain newlines. I used a newline to separate the fragments in my initial implementation for convenience, but feedback has helped me realize that it is best avoided. :) > Well, Japanese text may have spaces in it, and not as a mistake : > > 「This is a pen」と言う英語は... > > It's not clear to me what would happen if a line break occurred before or > after one of the spaces there, but I am suspecting that many typesetting > systems would not remove the space but leave it at the start or end of a > line. This is a good example. I expect to process text that contains references to English book titles that are formatted like this. I have seen typesetting perform miserably with such mixed-language text, giving lower "badness" to whitespace in the line breaking algorithm when a break in the Japanese text would be far better. Thank you for your thoughts on the subject! Cheers, Travis
- Follow-Ups:
- Re: [tlug] Unicode/ICU question about joining lines
- From: Curt J. Sampson
- References:
- [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Jim Breen
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Jim Breen
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Curt J. Sampson
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):