Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][tlug] Unicode/ICU question about joining lines
- Date: Fri, 13 Aug 2021 00:43:05 +0900
- From: "Stephen J. Turnbull" <turnbull.stephen.fw@example.com>
- Subject: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com>
Travis Cardwell writes: > Dear TLUG, > > I have a Unicode question, and I am posting to this mailing list because > it appears that the Lingo list is not used these days. It is, it just doesn't get much traffic because all linguistic problems are straightforward. ;-) > My problem is straightforward. I guess the statement is. The solution is a very contingent thing though. >From your replies to various folks, it sounds like you're doing this on user input, maybe in an editor. In that case, just don't use soft line breaks in the data (or use real soft linebreaks[1]), and let the text widget do the line breaking in the UI.[2] That's more or less a solved problem (hel-lo Gecko, is that a soft linebreak in your pocket or are you disappointed to see me?), and can be done with much less guesswork, even if language isn't specified. Eg, hyphenation is no problem; do what you like, if it's wrong it's ephemeral because the user either spelled with a hyphen or she didn't, and either way the one on the screen goes away when she submits the form. The other possibility would be to have LF be the soft line break, and insert it after any whitespace that signifies a linebreak point. In the (unlikely?) event that the user wants a paragraph break, use the TeX empty line separates paragraphs convention. And anybody who wants to do ASCII art, I guess they'll have to be fired. :-) > Unicode technical reports provide information about text segmentation > and line breaking, and ICU provides functionality for breaking strings > on boundaries of grapheme clusters. I have not been able to find > information or ICU functionality for joining strings, however. Doing this well isn't in scope for ICU because it requires actual knowledge of the languages being processed, as well as the formatting of the text (indented block quotations, for example), and in some cases the author's intent. Doing a bad job isn't in scope for ICU either, since the application programmer can do it in a few LOC. It's true that the UCD knows something about languages (for example, collation orders). But it knows nothing about this. > it would require classifying all Unicode blocks. Welcome to natural language processing. Last I looked, the number of Unicode blocks would fit in a short, maybe a byte, easy peasy. :-) Steve Footnotes: [1] Some possibilities: U+000B VERTICAL TABULATION, U+000C FORM FEED, U+000D CARRIAGE RETURN, U+0085 NEXT LINE, U+2028 LINE SEPARATOR, and U+2029 PARAGRAPH SEPARATOR, and pretty much any of the "high 16" ASCII control characters except ESC. If you don't "own" the widget you're at its mercy for the intepretation of any of those characters, though. Note that input is not a problem, because it would be automatically inserted at a line break point (after any intervening whitespace in languages like English) in order to mark the end for the display engine. [2] This can massively suck in some text widgets, eg the ones used when creating a new README.md in GitHub.
- Follow-Ups:
- Re: [tlug] Unicode/ICU question about joining lines
- From: Curt J. Sampson
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Darren Cook
- References:
- [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Recent conversation on devel@fedoraproject
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):