Re: [tlug] Unicode/ICU question about joining lines

Date: Thu, 12 Aug 2021 17:42:12 +0900
From: Travis Cardwell <travis.cardwell@example.com>
Subject: Re: [tlug] Unicode/ICU question about joining lines
References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com> <CABHGxq5Ma58RaaruwK6x+5o_vBh66fkjHnTjWYpaMC5FYOgOTg@mail.gmail.com> <CACaJP_RQusGfT6BFPP2SDh9oHtd3Y34eEmiHNdJArhwKbg4EdQ@mail.gmail.com> <CABHGxq5YcO1HPOOA_8yuhEmKeubNUfn8cS35OuB3MLMTamjYBQ@mail.gmail.com> <CACaJP_TvuwMQ3T6zqVxmgHeqzAV+i=hdeykq7EgmDayN4GNNvQ@mail.gmail.com> <YRTRNBM86fgWvvcu@telephonic.cynic.net>

Hi Curt!

On Thu, Aug 12, 2021 at 5:01 PM Curt J. Sampson wrote:
> I've thought on this a bit and I think you're wrong that Unicode would have
> anything to say about this. The main issue is that the use of whitespace is
> a lanugage-specific issue and Unicode _does not deal with langauge issues
> or even markup_, only character encoding issues.

In respect to the Unicode encoding, this is true.  The Unicode Character
Database (UCD) is also part of the Unicode Standard, however, and it
does deal with languages.  It defines metadata (properties) for
characters, and the International Components for Unicode (ICU) library
provides an API that makes use of these properties to implement
functions used in the implementation of software I18N, not just
encoding.  For example, ICU includes translations, dictionary lookup
functionality (required for correct segmentation in some languages),
etc.

> This is most obvious in the Han unification[1] of CJK ideographs, but is
> even in western languages if you think about it: we use the same \u0065 'e'
> for all Latin-script languages, rather than having a different 'e' for
> Turkish, despite that Turkish and its related languages have their own
> unique alphabet that is both missing letters in other European alphabets
> (no 'q', 'w' or 'x') and has letters that don't exist in other European
> alphabets ('ı', 'Ş' etc.).
>
> Imagine a soft newline between every word of the following two phrases.
> Note that "com" and "org" in the two texts are *not* in the same language,
> though they are the same string:
>
>   comとorgと言うドメインは...
>   com and org domains are...
>
> Seems easy enough: just look and say that if _either_ side has a Japanese
> character, it must be Japanese language, right? But oops:
>
>   The Japanese character と is used for...

There is indeed no way to implement a function that works with such
mixed scripts, as different people have different conventions.  There
are also languages that do not put spaces around words but instead use
spaces around punctuation.  I think that it would be worthwhile to
implement line joining that at least handles the simple cases, however.
In cases where the lines are not joined correctly, users can adjust the
source to fix the issue.

Note that my prototype code inserts a space if either side has is not a
Japanese character.  I prefer to err on the side of adding spaces
because languages without spaces just look poorly formatted when spaces
are added, while languages with spaces can be very difficult to read
without them.

> When considering this whole thing, it's probably also a hint that Unicode
> has (as far as I know) no character for a "soft" newline. And rightfully
> so, a soft newline sometimes isn't even a single character but instead lack
> of a sequence of characters. (E.g., in Markdown a newline  that is _not_
> followed by another newline is a soft break rather than a paragraph break.)
>
> [1]: https://en.wikipedia.org/wiki/Han_unification

The soft line break is an artifact of the source markup language that I
am using and is unrelated to the core problem.  The goal is to join
fragments of text:

    foldText :: [Text] -> Text

Neither the fragments nor the return value contain newlines.  I used a
newline to separate the fragments in my initial implementation for
convenience, but feedback has helped me realize that it is best
avoided. :)

> Well, Japanese text may have spaces in it, and not as a mistake :
>
>     「This is a pen」と言う英語は...
>
> It's not clear to me what would happen if a line break occurred before or
> after one of the spaces there, but I am suspecting that many typesetting
> systems would not remove the space but leave it at the start or end of a
> line.

This is a good example.  I expect to process text that contains
references to English book titles that are formatted like this.

I have seen typesetting perform miserably with such mixed-language
text, giving lower "badness" to whitespace in the line breaking
algorithm when a break in the Japanese text would be far better.

Thank you for your thoughts on the subject!

Cheers,

Travis

Follow-Ups:
- Re: [tlug] Unicode/ICU question about joining lines
  - From: Curt J. Sampson

References:
- [tlug] Unicode/ICU question about joining lines
  - From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
  - From: Jim Breen
- Re: [tlug] Unicode/ICU question about joining lines
  - From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
  - From: Jim Breen
- Re: [tlug] Unicode/ICU question about joining lines
  - From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
  - From: Curt J. Sampson

Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
Next by Date: Re: [tlug] Unicode/ICU question about joining lines
Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
Next by thread: Re: [tlug] Unicode/ICU question about joining lines
Index(es):
- Date
- Thread

Home | Main Index | Thread Index