
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Unicode/ICU question about joining lines
On Thu, Aug 12, 2021 at 1:32 PM Jim Breen wrote:
> OK, we now have some context. It's text with embedded markup
> sequences, which are aimed towards some downstream process to
> interpret for line-folding purposes. It might be "\n" or it might be
> "<br>" in the case of HTML. Nothing really to do with Unicode itself.
Indeed, the input that I am working with is markup, with formatting
instructions. The input it used in multiple ways, one of which requires
plain text (no formatting) without newlines. First, I parse the input
into an AST and traverse the AST to accumulate a string that lacks
formatting but still contains newlines. The only part of the problem
that remains is the function that I am asking about.
Unicode is relevant because Unicode properties (defined in the Unicode
Character Database) of the characters on either side of a line break can
be used to determine how the lines should be joined. I hoped that ICU
would already provide such functionality, but it does not.
> Quite easy to do, but it would need to be told the details of the
> sequence to handle (<br>, \n, etc.) and what to do with it (for
> English replace with a space, for Japanese append to preceding
> characters.)
The specific input source and format that I am using is completely
separate from this function, which just needs to deal with plain text
containing newline characters (`\n`).
As for determining how to join lines (with or without a space), one of
the solutions that I am considering is to expose a configuration option
and let users specify what to do, so that I do not need to use ICU at
all. I prefer to use ICU to do the correct thing automatically,
however, as it is more user friendly.
> Yes, it's nothing really to do with ICU, in fact Unicode generally
> tries to get as far away as possible from markup or text presentation
> issues. It does have some ancillary information about the
> line-breaking properties of characters to help downstream processes,
> but that's about all.
In my implementation, I use newline characters in the input of the
function only because it is convenient in the markup AST traversal
implementation. It would be more accurately represented as a list of
text fragments. (The markup AST traversal would then use two
accumulators: one for the current line and one for complete lines. I
did not implement it like this because it makes the code longer and
marginally more difficult to understand. Perhaps I will change my
implementation, however, to make it clear that the result is not
markup.)
The function in question would then take a list of text fragments as
input and return the joined text. This representation makes it clear
that the function is unrelated to markup.
Haskell:
foldText :: [Text] -> Text
Python type annotation for those who do not (yet?) grok Haskell type
signatures:
def fold_text(fragments: Sequence[str]) -> str: ...
IMHO, this is very much in the purview of ICU.
Cheers,
Travis
Home |
Main Index |
Thread Index