Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



On 2021-08-12 14:39 +0900 (Thu), Travis Cardwell wrote:

> Unicode is relevant because Unicode properties (defined in the Unicode
> Character Database) of the characters on either side of a line break can
> be used to determine how the lines should be joined.  I hoped that ICU
> would already provide such functionality, but it does not.

I've thought on this a bit and I think you're wrong that Unicode would have
anything to say about this. The main issue is that the use of whitespace is
a lanugage-specific issue and Unicode _does not deal with langauge issues
or even markup_, only character encoding issues.

This is most obvious in the Han unification[1] of CJK ideographs, but is
even in western languages if you think about it: we use the same \u0065 'e'
for all Latin-script languages, rather than having a different 'e' for
Turkish, despite that Turkish and its related languages have their own
unique alphabet that is both missing letters in other European alphabets
(no 'q', 'w' or 'x') and has letters that don't exist in other European
alphabets ('ı', 'Ş' etc.).

Imagine a soft newline between every word of the following two phrases.
Note that "com" and "org" in the two texts are *not* in the same language,
though they are the same string:

  comとorgと言うドメインは...
  com and org domains are...

Seems easy enough: just look and say that if _either_ side has a Japanese
character, it must be Japanese language, right? But oops:

  The Japanese character と is used for...

When considering this whole thing, it's probably also a hint that Unicode
has (as far as I know) no character for a "soft" newline. And rightfully
so, a soft newline sometimes isn't even a single character but instead lack
of a sequence of characters. (E.g., in Markdown a newline  that is _not_
followed by another newline is a soft break rather than a paragraph break.)

[1]: https://en.wikipedia.org/wiki/Han_unification

On 2021-08-12 16:16 +0900 (Thu), eizietheez@example.com wrote:

> Do you accept multi-lingual text? If not, then a simple hack would be to just
> look for spaces in the input text and classify the language accordingly.

Well, Japanese text may have spaces in it, and not as a mistake :

    「This is a pen」と言う英語は...

It's not clear to me what would happen if a line break occurred before or
after one of the spaces there, but I am suspecting that many typesetting
systems would not remove the space but leave it at the start or end of a
line.

cjs
-- 
Curt J. Sampson      <cjs@example.com>      +81 90 7737 2974

To iterate is human, to recurse divine.
    - L Peter Deutsch


Home | Main Index | Thread Index