Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



On Thu, Aug 12, 2021 at 1:32 PM Jim Breen wrote:
> OK, we now have some context. It's text with embedded markup
> sequences, which are aimed towards some downstream process to
> interpret for line-folding purposes. It might be "\n" or it might be
> "<br>" in the case of HTML. Nothing really to do with Unicode itself.

Indeed, the input that I am working with is markup, with formatting
instructions.  The input it used in multiple ways, one of which requires
plain text (no formatting) without newlines.  First, I parse the input
into an AST and traverse the AST to accumulate a string that lacks
formatting but still contains newlines.  The only part of the problem
that remains is the function that I am asking about.

Unicode is relevant because Unicode properties (defined in the Unicode
Character Database) of the characters on either side of a line break can
be used to determine how the lines should be joined.  I hoped that ICU
would already provide such functionality, but it does not.

> Quite easy to do, but it would need to be told the details of the
> sequence to handle (<br>, \n, etc.) and what to do with it (for
> English replace with a space, for Japanese append to preceding
> characters.)

The specific input source and format that I am using is completely
separate from this function, which just needs to deal with plain text
containing newline characters (`\n`).

As for determining how to join lines (with or without a space), one of
the solutions that I am considering is to expose a configuration option
and let users specify what to do, so that I do not need to use ICU at
all.  I prefer to use ICU to do the correct thing automatically,
however, as it is more user friendly.

> Yes, it's nothing really to do with ICU, in fact Unicode generally
> tries to get as far away as possible from markup or text presentation
> issues. It does have some ancillary information about the
> line-breaking properties of characters to help downstream processes,
> but that's about all.

In my implementation, I use newline characters in the input of the
function only because it is convenient in the markup AST traversal
implementation.  It would be more accurately represented as a list of
text fragments.  (The markup AST traversal would then use two
accumulators: one for the current line and one for complete lines.  I
did not implement it like this because it makes the code longer and
marginally more difficult to understand.  Perhaps I will change my
implementation, however, to make it clear that the result is not
markup.)

The function in question would then take a list of text fragments as
input and return the joined text.  This representation makes it clear
that the function is unrelated to markup.

Haskell:

    foldText :: [Text] -> Text

Python type annotation for those who do not (yet?) grok Haskell type
signatures:

    def fold_text(fragments: Sequence[str]) -> str: ...

IMHO, this is very much in the purview of ICU.

Cheers,

Travis


Home | Main Index | Thread Index