Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Unicode/ICU question about joining lines



Travis Cardwell writes:
 > Dear TLUG,
 > 
 > I have a Unicode question, and I am posting to this mailing list because
 > it appears that the Lingo list is not used these days.

It is, it just doesn't get much traffic because all linguistic
problems are straightforward. ;-)

 > My problem is straightforward.

I guess the statement is.  The solution is a very contingent thing
though.

>From your replies to various folks, it sounds like you're doing this
on user input, maybe in an editor.  In that case, just don't use soft
line breaks in the data (or use real soft linebreaks[1]), and let the
text widget do the line breaking in the UI.[2]  That's more or less a
solved problem (hel-lo Gecko, is that a soft linebreak in your pocket
or are you disappointed to see me?), and can be done with much less
guesswork, even if language isn't specified.  Eg, hyphenation is no
problem; do what you like, if it's wrong it's ephemeral because the
user either spelled with a hyphen or she didn't, and either way the
one on the screen goes away when she submits the form.

The other possibility would be to have LF be the soft line break, and
insert it after any whitespace that signifies a linebreak point.  In
the (unlikely?) event that the user wants a paragraph break, use the
TeX empty line separates paragraphs convention.  And anybody who wants
to do ASCII art, I guess they'll have to be fired. :-)

 > Unicode technical reports provide information about text segmentation
 > and line breaking, and ICU provides functionality for breaking strings
 > on boundaries of grapheme clusters.  I have not been able to find
 > information or ICU functionality for joining strings, however.

Doing this well isn't in scope for ICU because it requires actual
knowledge of the languages being processed, as well as the formatting
of the text (indented block quotations, for example), and in some
cases the author's intent.  Doing a bad job isn't in scope for ICU
either, since the application programmer can do it in a few LOC.

It's true that the UCD knows something about languages (for example,
collation orders).  But it knows nothing about this.

 > it would require classifying all Unicode blocks.

Welcome to natural language processing.  Last I looked, the number of
Unicode blocks would fit in a short, maybe a byte, easy peasy. :-)

Steve

Footnotes: 
[1]  Some possibilities: U+000B VERTICAL TABULATION, U+000C FORM FEED,
U+000D CARRIAGE RETURN, U+0085 NEXT LINE, U+2028 LINE SEPARATOR, and
U+2029 PARAGRAPH SEPARATOR, and pretty much any of the "high 16" ASCII
control characters except ESC.  If you don't "own" the widget you're at
its mercy for the intepretation of any of those characters, though.
    Note that input is not a problem, because it would be
automatically inserted at a line break point (after any intervening
whitespace in languages like English) in order to mark the end for the
display engine.  
[2]  This can massively suck in some text widgets, eg the ones used
when creating a new README.md in GitHub.




Home | Main Index | Thread Index