Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Unicode/ICU question about joining lines
- Date: Thu, 12 Aug 2021 16:56:33 +0900
- From: "Curt J. Sampson" <cjs@example.com>
- Subject: Re: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com> <CABHGxq5Ma58RaaruwK6x+5o_vBh66fkjHnTjWYpaMC5FYOgOTg@mail.gmail.com> <CACaJP_RQusGfT6BFPP2SDh9oHtd3Y34eEmiHNdJArhwKbg4EdQ@mail.gmail.com> <CABHGxq5YcO1HPOOA_8yuhEmKeubNUfn8cS35OuB3MLMTamjYBQ@mail.gmail.com> <CACaJP_TvuwMQ3T6zqVxmgHeqzAV+i=hdeykq7EgmDayN4GNNvQ@mail.gmail.com>
On 2021-08-12 14:39 +0900 (Thu), Travis Cardwell wrote: > Unicode is relevant because Unicode properties (defined in the Unicode > Character Database) of the characters on either side of a line break can > be used to determine how the lines should be joined. I hoped that ICU > would already provide such functionality, but it does not. I've thought on this a bit and I think you're wrong that Unicode would have anything to say about this. The main issue is that the use of whitespace is a lanugage-specific issue and Unicode _does not deal with langauge issues or even markup_, only character encoding issues. This is most obvious in the Han unification[1] of CJK ideographs, but is even in western languages if you think about it: we use the same \u0065 'e' for all Latin-script languages, rather than having a different 'e' for Turkish, despite that Turkish and its related languages have their own unique alphabet that is both missing letters in other European alphabets (no 'q', 'w' or 'x') and has letters that don't exist in other European alphabets ('ı', 'Ş' etc.). Imagine a soft newline between every word of the following two phrases. Note that "com" and "org" in the two texts are *not* in the same language, though they are the same string: comとorgと言うドメインは... com and org domains are... Seems easy enough: just look and say that if _either_ side has a Japanese character, it must be Japanese language, right? But oops: The Japanese character と is used for... When considering this whole thing, it's probably also a hint that Unicode has (as far as I know) no character for a "soft" newline. And rightfully so, a soft newline sometimes isn't even a single character but instead lack of a sequence of characters. (E.g., in Markdown a newline that is _not_ followed by another newline is a soft break rather than a paragraph break.) [1]: https://en.wikipedia.org/wiki/Han_unification On 2021-08-12 16:16 +0900 (Thu), eizietheez@example.com wrote: > Do you accept multi-lingual text? If not, then a simple hack would be to just > look for spaces in the input text and classify the language accordingly. Well, Japanese text may have spaces in it, and not as a mistake : 「This is a pen」と言う英語は... It's not clear to me what would happen if a line break occurred before or after one of the spaces there, but I am suspecting that many typesetting systems would not remove the space but leave it at the start or end of a line. cjs -- Curt J. Sampson <cjs@example.com> +81 90 7737 2974 To iterate is human, to recurse divine. - L Peter Deutsch
- Follow-Ups:
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- References:
- [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Jim Breen
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Jim Breen
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):