Re: [tlug] Unicode/ICU question about joining lines

Date: Fri, 13 Aug 2021 07:34:28 +0900
From: Travis Cardwell <travis.cardwell@example.com>
Subject: Re: [tlug] Unicode/ICU question about joining lines
References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com> <24853.16777.99356.678602@turnbull.sk.tsukuba.ac.jp>

Hi Steve!

On Fri, Aug 13, 2021 at 12:48 AM Stephen J. Turnbull  wrote:
> Travis Cardwell writes:
> > My problem is straightforward.
>
> I guess the statement is.  The solution is a very contingent thing
> though.

This is the most valuable thing that I learned from the discussion.
>From this, I realized that I should implement a solution that is easy
to document/understand and avoid solutions with edge cases that require
special explanation.

> From your replies to various folks, it sounds like you're doing this
> on user input, maybe in an editor.  In that case, just don't use soft
> line breaks in the data (or use real soft linebreaks[1]), and let the
> text widget do the line breaking in the UI.[2]  That's more or less a
> solved problem (hel-lo Gecko, is that a soft linebreak in your pocket
> or are you disappointed to see me?), and can be done with much less
> guesswork, even if language isn't specified.  Eg, hyphenation is no
> problem; do what you like, if it's wrong it's ephemeral because the
> user either spelled with a hyphen or she didn't, and either way the
> one on the screen goes away when she submits the form.

Thank you for the advice.

The input is read from text files, and I estimate that the vast majority
of users will manage the files on GitHub.  Changes to text will often be
done via pull request.  Git(Hub) diffs of wrapped text can be a bit
frustrating, but I think that they are much easier to read than diffs of
extremely long lines of text.

I had an idea for solving the problem in a different way and implemented
it yesterday evening.  The value that I am working with is read from a
YAML block of metadata, so I can make use of the different types of YAML
block scalar syntax.

When using a language that separates words with spaces, users can use a
folding block scalar, which joins lines with a space in between.

    description: >
      This is an
      English example.

When this YAML is parsed, the value is `This is an English example.`
with no newline characters.

When using a language that does not separate words with spaces, users
can use a literal block scalar, which keeps all but the trailing
newline.

    description: |
      これは日本語の
      例です。

When this YAML is parsed, the value is `これは日本語の\n例です。` with a
newline in the middle of the text.

The software folds lines by joining them without inserting a space.  The
value of the English example stays the same since it does not include
newline characters, while the value of the Japanese example becomes
`これは日本語の例です。` as desired.

> The other possibility would be to have LF be the soft line break, and
> insert it after any whitespace that signifies a linebreak point.  In
> the (unlikely?) event that the user wants a paragraph break, use the
> TeX empty line separates paragraphs convention.  And anybody who wants
> to do ASCII art, I guess they'll have to be fired. :-)

The input that I am working with is a single paragraph of prose, so I do
not have to deal with such complications.

> Doing this well isn't in scope for ICU because it requires actual
> knowledge of the languages being processed, as well as the formatting
> of the text (indented block quotations, for example), and in some
> cases the author's intent.  Doing a bad job isn't in scope for ICU
> either, since the application programmer can do it in a few LOC.

My point of view was that the `BreakIterator` is equally "bad" as a
function that joins fragments of text based on the Unicode block of
neighboring characters.  (It does a decent job when dealing with
languages that separate words with spaces.)  I shall concede, however.

https://unicode-org.github.io/icu/userguide/boundaryanalysis/

> > it would require classifying all Unicode blocks.
>
> Welcome to natural language processing.  Last I looked, the number of
> Unicode blocks would fit in a short, maybe a byte, easy peasy. :-)

I was able to work around the issue this time, but I have been
frustrated with software that always inserts a space when joining lines
for many years, so I will likely revisit the problem in the future and
classify those blocks! :)

Thanks again!

Cheers,

Travis

Follow-Ups:
- Re: [tlug] Unicode/ICU question about joining lines
  - From: eizietheez

References:
- [tlug] Unicode/ICU question about joining lines
  - From: Travis Cardwell
- [tlug] Unicode/ICU question about joining lines
  - From: Stephen J. Turnbull

Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
Next by Date: Re: [tlug] Unicode/ICU question about joining lines
Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
Next by thread: Re: [tlug] Unicode/ICU question about joining lines
Index(es):
- Date
- Thread

Home | Main Index | Thread Index