Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



On Thu, Aug 12, 2021 at 1:13 PM Michael Paddon wrote:
> Most notably, a newline is *not* a soft break in Unicode. So you may
> need to elide or replace them in your source text.

In my use case, I am indeed processing source text that includes
formatting.  The source text is parsed into an AST.  The content is used
in multiple ways, and one requirement is to transform the text into a
string that does not include formatting or newlines.  I traverse the
AST, in which soft line breaks are represented as `SOFTBREAK` nodes, and
accumulate the return value.  Joining lines correctly is the only part
of the implementation that is problematic.

> This is necessarily
> language aware, but a simple heuristic will cover many cases. If the
> codepoint on either side is Latin or Cyrillic, then turn them into
> spaces, otherwise elide them.

The prototype code that I included in my original question uses this
strategy, where lines are joined according to the Unicode block of the
characters on either side of a line break.  If I use this strategy, I
will do some research to make the function work correctly for major
languages.  Note that I prefer to err on the side of adding spaces
because languages without spaces just look poorly formatted when spaces
are added, while languages with spaces can be very difficult to read
without them (example: Korean).

> Then you can apply the Unicode line breaking algorithm with likely
> good results.

I just need to join lines, not break them.

Cheers,

Travis


Home | Main Index | Thread Index