Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Unicode/ICU question about joining lines
- Date: Thu, 12 Aug 2021 14:39:43 +0900
- From: Travis Cardwell <travis.cardwell@example.com>
- Subject: Re: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com> <CABHGxq5Ma58RaaruwK6x+5o_vBh66fkjHnTjWYpaMC5FYOgOTg@mail.gmail.com> <CACaJP_RQusGfT6BFPP2SDh9oHtd3Y34eEmiHNdJArhwKbg4EdQ@mail.gmail.com> <CABHGxq5YcO1HPOOA_8yuhEmKeubNUfn8cS35OuB3MLMTamjYBQ@mail.gmail.com>
On Thu, Aug 12, 2021 at 1:32 PM Jim Breen wrote: > OK, we now have some context. It's text with embedded markup > sequences, which are aimed towards some downstream process to > interpret for line-folding purposes. It might be "\n" or it might be > "<br>" in the case of HTML. Nothing really to do with Unicode itself. Indeed, the input that I am working with is markup, with formatting instructions. The input it used in multiple ways, one of which requires plain text (no formatting) without newlines. First, I parse the input into an AST and traverse the AST to accumulate a string that lacks formatting but still contains newlines. The only part of the problem that remains is the function that I am asking about. Unicode is relevant because Unicode properties (defined in the Unicode Character Database) of the characters on either side of a line break can be used to determine how the lines should be joined. I hoped that ICU would already provide such functionality, but it does not. > Quite easy to do, but it would need to be told the details of the > sequence to handle (<br>, \n, etc.) and what to do with it (for > English replace with a space, for Japanese append to preceding > characters.) The specific input source and format that I am using is completely separate from this function, which just needs to deal with plain text containing newline characters (`\n`). As for determining how to join lines (with or without a space), one of the solutions that I am considering is to expose a configuration option and let users specify what to do, so that I do not need to use ICU at all. I prefer to use ICU to do the correct thing automatically, however, as it is more user friendly. > Yes, it's nothing really to do with ICU, in fact Unicode generally > tries to get as far away as possible from markup or text presentation > issues. It does have some ancillary information about the > line-breaking properties of characters to help downstream processes, > but that's about all. In my implementation, I use newline characters in the input of the function only because it is convenient in the markup AST traversal implementation. It would be more accurately represented as a list of text fragments. (The markup AST traversal would then use two accumulators: one for the current line and one for complete lines. I did not implement it like this because it makes the code longer and marginally more difficult to understand. Perhaps I will change my implementation, however, to make it clear that the result is not markup.) The function in question would then take a list of text fragments as input and return the joined text. This representation makes it clear that the function is unrelated to markup. Haskell: foldText :: [Text] -> Text Python type annotation for those who do not (yet?) grok Haskell type signatures: def fold_text(fragments: Sequence[str]) -> str: ... IMHO, this is very much in the purview of ICU. Cheers, Travis
- Follow-Ups:
- Re: [tlug] Unicode/ICU question about joining lines
- From: Curt J. Sampson
- References:
- [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Jim Breen
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Jim Breen
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):