Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Unicode/ICU question about joining lines
- Date: Wed, 11 Aug 2021 20:46:35 +0900
- From: Travis Cardwell <travis.cardwell@example.com>
- Subject: Re: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com> <157b93c3-0827-66bd-58f0-30225f5601b1@dcook.org>
Hi Darren! On Wed, Aug 11, 2021 at 6:01 PM Darren Cook wrote: > The [unicode] and/or [icu] tags on StackOverflow might help. A quick > scan looks there is more questions about going the other way, though. Thank you for the idea! I was unable to find any relevant information, unfortunately. > Do you need to handle all possible languages? I'd probably start by > making the "" rule for lines that end in the CJK block characters, as > well as "-", and use " " as the default rule for all other characters, > and adapt that as people complain. > > But if your source text contains hyphens at the end of lines, people > will start complaining very quickly. Knowing the difference between a > non-hyphenated word that got split with a hyphen, and a hyphenated word > that got split at its hyphen, is a big jump in complexity. One of the features of the software is that it should indeed handle any language. Fortunately, line folding can be regarded as a convenience. Text is easier to edit when broken up into multiple lines, but users can put the whole value in a single, long line in cases when line folding is not done correctly. (Values typically range between three and twenty lines when wrapped at 80 characters.) The software does not need to handle hyphenation. That would indeed greatly complicate things. Starting with rules for languages that I am certain about and then adding rules on user request/complaint (or pull request!) would probably be acceptable. In the prototype, adding a rule just involves adding a line to the `noSpaceBlocks` list. I have found some information about which (major) languages do not use spaces, such as the following: https://linguistics.stackexchange.com/questions/6131/is-there-a-long-list-of-languages-whose-writing-systems-dont-use-spaces Another option that is still under consideration is exposing a new configuration value and not using ICU at all. The software will only deal with a single language at a time, so static configuration should be sufficient. (Note that I *may* need to do this anyway due to linking. I would like the software to support building as a static executable, and I have not yet tried building a static executable with ICU. I am hopeful that it will work without issue, though.) Thank you! Travis
- References:
- [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Darren Cook
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):