Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Unicode/ICU question about joining lines
- Date: Thu, 12 Aug 2021 18:26:17 +0900
- From: "Curt J. Sampson" <cjs@example.com>
- Subject: Re: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com> <CABHGxq5Ma58RaaruwK6x+5o_vBh66fkjHnTjWYpaMC5FYOgOTg@mail.gmail.com> <CACaJP_RQusGfT6BFPP2SDh9oHtd3Y34eEmiHNdJArhwKbg4EdQ@mail.gmail.com> <CABHGxq5YcO1HPOOA_8yuhEmKeubNUfn8cS35OuB3MLMTamjYBQ@mail.gmail.com> <CACaJP_TvuwMQ3T6zqVxmgHeqzAV+i=hdeykq7EgmDayN4GNNvQ@mail.gmail.com> <YRTRNBM86fgWvvcu@telephonic.cynic.net> <CACaJP_Se7df=U0-g6oyF2-C6tL0rqmOeiSkgdG10cr99rhre9w@mail.gmail.com>
On 2021-08-12 17:42 +0900 (Thu), Travis Cardwell wrote: > The soft line break is an artifact of the source markup language that I > am using and is unrelated to the core problem. Ah, well, I'd argue that the whole idea of a "soft line break" _is_ the core problem. You're introducing something new here that Unicode was never designed to handle (and probably designed to avoid handling). That may not be completely obvious given that Unicode offers things like Standard Annex #14 "Unicode Line Breaking Algorithm"[tr14], but what you're doing is not trying to figure out how to add whitespace, but the exact opposite: figuring out when _removing_ whitespace is ok. [tr14]: http://www.unicode.org/reports/tr14/ > foldText :: [Text] -> Text > ... > Neither the fragments nor the return value contain newlines. Right. I was just mentioning (it seems more for others than for you) that the implementation of a soft line break is not necessarily a "character" thing but may be a higher-level property of the text, to help clarify that this is not something Unicode should be addressing. > For example, ICU includes translations, dictionary lookup functionality > (required for correct segmentation in some languages), etc. Yes, but again: that's about adding, not removing, whitespace. (I myself had not realized that these are such different things until now.) > I think that it would be worthwhile to implement line joining that at > least handles the simple cases, however. Putting on my "software engineering" hat (the real engineering hat that involves designing towards certain types and rates of failure), I would definitely want _not_ to see simple but not-always-correct cases in the standard but instead leave them out of the standard specifically to force individual developers to implement (or download, steal, whatever) the particular algorithms and implementations break least annoyingly in their situations. We know that they're all going to fail, but the standards guys do not know which failures are livable and which are highly damaging, because that's different for different users. In other words, you're going in a fine direction and should carry on: the lack of anything addressing the removal of whitespace in Unicode itself is to your advantage in that it lets you make tradeoffs for your situation that will keep your costs low. cjs -- Curt J. Sampson <cjs@example.com> +81 90 7737 2974 To iterate is human, to recurse divine. - L Peter Deutsch
- References:
- [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Jim Breen
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Jim Breen
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Curt J. Sampson
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by Date: Re: [tlug] Recent conversation on devel@fedoraproject
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):