Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Unicode/ICU question about joining lines
- Date: Fri, 13 Aug 2021 07:34:28 +0900
- From: Travis Cardwell <travis.cardwell@example.com>
- Subject: Re: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com> <24853.16777.99356.678602@turnbull.sk.tsukuba.ac.jp>
Hi Steve! On Fri, Aug 13, 2021 at 12:48 AM Stephen J. Turnbull wrote: > Travis Cardwell writes: > > My problem is straightforward. > > I guess the statement is. The solution is a very contingent thing > though. This is the most valuable thing that I learned from the discussion. >From this, I realized that I should implement a solution that is easy to document/understand and avoid solutions with edge cases that require special explanation. > From your replies to various folks, it sounds like you're doing this > on user input, maybe in an editor. In that case, just don't use soft > line breaks in the data (or use real soft linebreaks[1]), and let the > text widget do the line breaking in the UI.[2] That's more or less a > solved problem (hel-lo Gecko, is that a soft linebreak in your pocket > or are you disappointed to see me?), and can be done with much less > guesswork, even if language isn't specified. Eg, hyphenation is no > problem; do what you like, if it's wrong it's ephemeral because the > user either spelled with a hyphen or she didn't, and either way the > one on the screen goes away when she submits the form. Thank you for the advice. The input is read from text files, and I estimate that the vast majority of users will manage the files on GitHub. Changes to text will often be done via pull request. Git(Hub) diffs of wrapped text can be a bit frustrating, but I think that they are much easier to read than diffs of extremely long lines of text. I had an idea for solving the problem in a different way and implemented it yesterday evening. The value that I am working with is read from a YAML block of metadata, so I can make use of the different types of YAML block scalar syntax. When using a language that separates words with spaces, users can use a folding block scalar, which joins lines with a space in between. description: > This is an English example. When this YAML is parsed, the value is `This is an English example.` with no newline characters. When using a language that does not separate words with spaces, users can use a literal block scalar, which keeps all but the trailing newline. description: | これは日本語の 例です。 When this YAML is parsed, the value is `これは日本語の\n例です。` with a newline in the middle of the text. The software folds lines by joining them without inserting a space. The value of the English example stays the same since it does not include newline characters, while the value of the Japanese example becomes `これは日本語の例です。` as desired. > The other possibility would be to have LF be the soft line break, and > insert it after any whitespace that signifies a linebreak point. In > the (unlikely?) event that the user wants a paragraph break, use the > TeX empty line separates paragraphs convention. And anybody who wants > to do ASCII art, I guess they'll have to be fired. :-) The input that I am working with is a single paragraph of prose, so I do not have to deal with such complications. > Doing this well isn't in scope for ICU because it requires actual > knowledge of the languages being processed, as well as the formatting > of the text (indented block quotations, for example), and in some > cases the author's intent. Doing a bad job isn't in scope for ICU > either, since the application programmer can do it in a few LOC. My point of view was that the `BreakIterator` is equally "bad" as a function that joins fragments of text based on the Unicode block of neighboring characters. (It does a decent job when dealing with languages that separate words with spaces.) I shall concede, however. https://unicode-org.github.io/icu/userguide/boundaryanalysis/ > > it would require classifying all Unicode blocks. > > Welcome to natural language processing. Last I looked, the number of > Unicode blocks would fit in a short, maybe a byte, easy peasy. :-) I was able to work around the issue this time, but I have been frustrated with software that always inserts a space when joining lines for many years, so I will likely revisit the problem in the future and classify those blocks! :) Thanks again! Cheers, Travis
- Follow-Ups:
- Re: [tlug] Unicode/ICU question about joining lines
- From: eizietheez
- References:
- [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- [tlug] Unicode/ICU question about joining lines
- From: Stephen J. Turnbull
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):