Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



Hi Brandon!

On Fri, Aug 13, 2021 at 10:25 AM <eizietheez@example.com> wrote:
> Admittedly, I am out of my comfort zone here, but isn't orthography an
> orthogonal issue to the script itself? For a fully cross-lingual solution, I
> would suspect that you at least need metadata about what language you are
> processing, in addition to the characters themself.

Indeed it is.  For example, even European languages with a shared Latin
script have different rules for punctuation (including spacing).  I
think that writing software to normalize text according to the
orthography of the language would be difficult.  My humble function does
not attempt to do this.

The (much more approachable) problem that I would like to work on is
that of always joining fragments of text with spaces.  It is impossible
to write a function that works for everybody for the reasons that have
been discussed, but my idea is to at least handle the common cases.

For example, fragments `例` and `です。` generally do not join with a
space.  Fragments `主にHaskell` and `で実装しました。` is an example of
fragments that would not join correctly using the prototype code.  In
the case of dealing with markup, it is up to the user to format the
input appropriately (by choosing where to wrap lines).  Fragments
`主にHaskellで` and `実装しました。` join fine.  (Note that particles
are not 行頭禁則, but human editors would wrap after the particle
anyway.)

> What are some examples of scripts that differ in their usage of word-boundary
> spaces depending on language? A dumb example might be classical Latin and Greek
> vs their modern equivalents.

Since languages change over time, there are indeed many examples of
languages that change word-boundary orthography while keeping the same
script.

I saw an interesting example on Stack Overflow: Japanese does not
generally separate words with spaces, but many books for children do!
My daughter's books are written in only ひらがな, and those spaces sure
do help parse the words from long strings of ひらがな!

I do not know of any other examples, though.  Any linguists on the list?

Cheers,

Travis


Home | Main Index | Thread Index