Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



On 2021-08-12 17:42 +0900 (Thu), Travis Cardwell wrote:

> The soft line break is an artifact of the source markup language that I
> am using and is unrelated to the core problem.

Ah, well, I'd argue that the whole idea of a "soft line break" _is_ the
core problem. You're introducing something new here that Unicode was never
designed to handle (and probably designed to avoid handling). That may not
be completely obvious given that Unicode offers things like Standard Annex
#14 "Unicode Line Breaking Algorithm"[tr14], but what you're doing is not
trying to figure out how to add whitespace, but the exact opposite:
figuring out when _removing_ whitespace is ok.

[tr14]: http://www.unicode.org/reports/tr14/

>     foldText :: [Text] -> Text
> ...
> Neither the fragments nor the return value contain newlines.

Right. I was just mentioning (it seems more for others than for you) that
the implementation of a soft line break is not necessarily a "character"
thing but may be a higher-level property of the text, to help clarify that
this is not something Unicode should be addressing.

> For example, ICU includes translations, dictionary lookup functionality
> (required for correct segmentation in some languages), etc.

Yes, but again: that's about adding, not removing, whitespace. (I myself
had not realized that these are such different things until now.)

> I think that it would be worthwhile to implement line joining that at
> least handles the simple cases, however.

Putting on my "software engineering" hat (the real engineering hat that
involves designing towards certain types and rates of failure), I would
definitely want _not_ to see simple but not-always-correct cases in the
standard but instead leave them out of the standard specifically to force
individual developers to implement (or download, steal, whatever) the
particular algorithms and implementations break least annoyingly in their
situations. We know that they're all going to fail, but the standards guys
do not know which failures are livable and which are highly damaging,
because that's different for different users.

In other words, you're going in a fine direction and should carry on: the
lack of anything addressing the removal of whitespace in Unicode itself is
to your advantage in that it lets you make tradeoffs for your situation
that will keep your costs low.

cjs
-- 
Curt J. Sampson      <cjs@example.com>      +81 90 7737 2974

To iterate is human, to recurse divine.
    - L Peter Deutsch


Home | Main Index | Thread Index