Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Unicode/ICU question about joining lines



Dear TLUG,

I have a Unicode question, and I am posting to this mailing list because
it appears that the Lingo list is not used these days.

My problem is straightforward.  Given a string containing a paragraph of
text with "soft" line breaks, I want to output a string containing the
text without line breaks.  The way that lines are joined depends on the
language.  Many languages such as English require spaces, while many
languages such as Japanese do not use spaces.

Unicode technical reports provide information about text segmentation
and line breaking, and ICU provides functionality for breaking strings
on boundaries of grapheme clusters.  I have not been able to find
information or ICU functionality for joining strings, however.

One idea for a solution is to join lines based on the Unicode blocks of
the characters immediately before and after the line break.  I do not
know if this strategy would handle all cases, however, and it would
require classifying all Unicode blocks.  I am pasting the code for a
prototype below.

It seems like such functionality should exist.  Perhaps there is a
different ICU property that can be used?  What am I missing?  Does
anybody know of a better and/or easier way to solve this problem?

Thanks!

Travis

----

-- https://hackage.haskell.org/package/text
import qualified Data.Text.Lazy as TL

-- https://hackage.haskell.org/package/text-icu
import qualified Data.Text.ICU.Char as TIC

foldLines :: TL.Text -> TL.Text
foldLines = foldr1 go . TL.lines
  where
    go :: TL.Text -> TL.Text -> TL.Text
    go tL tR = case (lastCharBlock tL, firstCharBlock tR) of
      (Just blockL, Just blockR)
        | blockL `elem` noSpaceBlocks &&
          blockR `elem` noSpaceBlocks -> tL <> tR
        | otherwise -> tL <> " " <> tR
      (Nothing, _r) -> tR
      (_l, Nothing) -> tL

    lastCharBlock, firstCharBlock :: TL.Text -> Maybe TIC.BlockCode
    lastCharBlock = fmap (TIC.blockCode . snd) . TL.unsnoc
    firstCharBlock = fmap (TIC.blockCode . fst) . TL.uncons

    noSpaceBlocks :: [TIC.BlockCode]
    noSpaceBlocks =
      [ TIC.CJKSymbolsAndPunctuation
      , TIC.Hiragana
      , TIC.Katakana
      -- TODO
      ]


Home | Main Index | Thread Index