Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Unicode/ICU question about joining lines
- Date: Thu, 12 Aug 2021 13:55:30 +0900
- From: Travis Cardwell <travis.cardwell@example.com>
- Subject: Re: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_RQusGfT6BFPP2SDh9oHtd3Y34eEmiHNdJArhwKbg4EdQ@mail.gmail.com> <20559dcf-395a-46ec-ab2c-4cba05dad310@email.android.com>
On Thu, Aug 12, 2021 at 1:13 PM Michael Paddon wrote: > Most notably, a newline is *not* a soft break in Unicode. So you may > need to elide or replace them in your source text. In my use case, I am indeed processing source text that includes formatting. The source text is parsed into an AST. The content is used in multiple ways, and one requirement is to transform the text into a string that does not include formatting or newlines. I traverse the AST, in which soft line breaks are represented as `SOFTBREAK` nodes, and accumulate the return value. Joining lines correctly is the only part of the implementation that is problematic. > This is necessarily > language aware, but a simple heuristic will cover many cases. If the > codepoint on either side is Latin or Cyrillic, then turn them into > spaces, otherwise elide them. The prototype code that I included in my original question uses this strategy, where lines are joined according to the Unicode block of the characters on either side of a line break. If I use this strategy, I will do some research to make the function work correctly for major languages. Note that I prefer to err on the side of adding spaces because languages without spaces just look poorly formatted when spaces are added, while languages with spaces can be very difficult to read without them (example: Korean). > Then you can apply the Unicode line breaking algorithm with likely > good results. I just need to join lines, not break them. Cheers, Travis
- References:
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Michael Paddon
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):