
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Unicode/ICU question about joining lines
Hi Steve!
On Fri, Aug 13, 2021 at 12:48 AM Stephen J. Turnbull wrote:
> Travis Cardwell writes:
> > My problem is straightforward.
>
> I guess the statement is. The solution is a very contingent thing
> though.
This is the most valuable thing that I learned from the discussion.
>From this, I realized that I should implement a solution that is easy
to document/understand and avoid solutions with edge cases that require
special explanation.
> From your replies to various folks, it sounds like you're doing this
> on user input, maybe in an editor. In that case, just don't use soft
> line breaks in the data (or use real soft linebreaks[1]), and let the
> text widget do the line breaking in the UI.[2] That's more or less a
> solved problem (hel-lo Gecko, is that a soft linebreak in your pocket
> or are you disappointed to see me?), and can be done with much less
> guesswork, even if language isn't specified. Eg, hyphenation is no
> problem; do what you like, if it's wrong it's ephemeral because the
> user either spelled with a hyphen or she didn't, and either way the
> one on the screen goes away when she submits the form.
Thank you for the advice.
The input is read from text files, and I estimate that the vast majority
of users will manage the files on GitHub. Changes to text will often be
done via pull request. Git(Hub) diffs of wrapped text can be a bit
frustrating, but I think that they are much easier to read than diffs of
extremely long lines of text.
I had an idea for solving the problem in a different way and implemented
it yesterday evening. The value that I am working with is read from a
YAML block of metadata, so I can make use of the different types of YAML
block scalar syntax.
When using a language that separates words with spaces, users can use a
folding block scalar, which joins lines with a space in between.
description: >
This is an
English example.
When this YAML is parsed, the value is `This is an English example.`
with no newline characters.
When using a language that does not separate words with spaces, users
can use a literal block scalar, which keeps all but the trailing
newline.
description: |
これは日本語の
例です。
When this YAML is parsed, the value is `これは日本語の\n例です。` with a
newline in the middle of the text.
The software folds lines by joining them without inserting a space. The
value of the English example stays the same since it does not include
newline characters, while the value of the Japanese example becomes
`これは日本語の例です。` as desired.
> The other possibility would be to have LF be the soft line break, and
> insert it after any whitespace that signifies a linebreak point. In
> the (unlikely?) event that the user wants a paragraph break, use the
> TeX empty line separates paragraphs convention. And anybody who wants
> to do ASCII art, I guess they'll have to be fired. :-)
The input that I am working with is a single paragraph of prose, so I do
not have to deal with such complications.
> Doing this well isn't in scope for ICU because it requires actual
> knowledge of the languages being processed, as well as the formatting
> of the text (indented block quotations, for example), and in some
> cases the author's intent. Doing a bad job isn't in scope for ICU
> either, since the application programmer can do it in a few LOC.
My point of view was that the `BreakIterator` is equally "bad" as a
function that joins fragments of text based on the Unicode block of
neighboring characters. (It does a decent job when dealing with
languages that separate words with spaces.) I shall concede, however.
https://unicode-org.github.io/icu/userguide/boundaryanalysis/
> > it would require classifying all Unicode blocks.
>
> Welcome to natural language processing. Last I looked, the number of
> Unicode blocks would fit in a short, maybe a byte, easy peasy. :-)
I was able to work around the issue this time, but I have been
frustrated with software that always inserts a space when joining lines
for many years, so I will likely revisit the problem in the future and
classify those blocks! :)
Thanks again!
Cheers,
Travis
Home |
Main Index |
Thread Index