> I was able to work around the issue this time, but I have been
> frustrated with software that always inserts a space when joining lines
> for many years, so I will likely revisit the problem in the future and
> classify those blocks! :)

Admittedly, I am out of my comfort zone here, but isn't orthography an
orthogonal issue to the script itself? For a fully cross-lingual solution, I
would suspect that you at least need metadata about what language you are
processing, in addition to the characters themself.

What are some examples of scripts that differ in their usage of word-boundary
spaces depending on language? A dumb example might be classical Latin and Greek
vs their modern equivalents.

