Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



Hi Darren!

On Wed, Aug 11, 2021 at 6:01 PM Darren Cook wrote:
> The [unicode] and/or [icu] tags on StackOverflow might help. A quick
> scan looks there is more questions about going the other way, though.

Thank you for the idea!  I was unable to find any relevant information,
unfortunately.

> Do you need to handle all possible languages? I'd probably start by
> making the "" rule for lines that end in the CJK block characters, as
> well as "-", and use " " as the default rule for all other characters,
> and adapt that as people complain.
>
> But if your source text contains hyphens at the end of lines, people
> will start complaining very quickly. Knowing the difference between a
> non-hyphenated word that got split with a hyphen, and a hyphenated word
> that got split at its hyphen, is a big jump in complexity.

One of the features of the software is that it should indeed handle any
language.  Fortunately, line folding can be regarded as a convenience.
Text is easier to edit when broken up into multiple lines, but users can
put the whole value in a single, long line in cases when line folding is
not done correctly.  (Values typically range between three and twenty
lines when wrapped at 80 characters.)

The software does not need to handle hyphenation.  That would indeed
greatly complicate things.

Starting with rules for languages that I am certain about and then
adding rules on user request/complaint (or pull request!) would probably
be acceptable.  In the prototype, adding a rule just involves adding a
line to the `noSpaceBlocks` list.  I have found some information about
which (major) languages do not use spaces, such as the following:

https://linguistics.stackexchange.com/questions/6131/is-there-a-long-list-of-languages-whose-writing-systems-dont-use-spaces

Another option that is still under consideration is exposing a new
configuration value and not using ICU at all.  The software will only
deal with a single language at a time, so static configuration should
be sufficient.  (Note that I *may* need to do this anyway due to
linking.  I would like the software to support building as a static
executable, and I have not yet tried building a static executable with
ICU.  I am hopeful that it will work without issue, though.)

Thank you!

Travis


Home | Main Index | Thread Index