Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][tlug] Unicode/ICU question about joining lines
- Date: Wed, 11 Aug 2021 17:24:57 +0900
- From: Travis Cardwell <travis.cardwell@example.com>
- Subject: [tlug] Unicode/ICU question about joining lines
Dear TLUG, I have a Unicode question, and I am posting to this mailing list because it appears that the Lingo list is not used these days. My problem is straightforward. Given a string containing a paragraph of text with "soft" line breaks, I want to output a string containing the text without line breaks. The way that lines are joined depends on the language. Many languages such as English require spaces, while many languages such as Japanese do not use spaces. Unicode technical reports provide information about text segmentation and line breaking, and ICU provides functionality for breaking strings on boundaries of grapheme clusters. I have not been able to find information or ICU functionality for joining strings, however. One idea for a solution is to join lines based on the Unicode blocks of the characters immediately before and after the line break. I do not know if this strategy would handle all cases, however, and it would require classifying all Unicode blocks. I am pasting the code for a prototype below. It seems like such functionality should exist. Perhaps there is a different ICU property that can be used? What am I missing? Does anybody know of a better and/or easier way to solve this problem? Thanks! Travis ---- -- https://hackage.haskell.org/package/text import qualified Data.Text.Lazy as TL -- https://hackage.haskell.org/package/text-icu import qualified Data.Text.ICU.Char as TIC foldLines :: TL.Text -> TL.Text foldLines = foldr1 go . TL.lines where go :: TL.Text -> TL.Text -> TL.Text go tL tR = case (lastCharBlock tL, firstCharBlock tR) of (Just blockL, Just blockR) | blockL `elem` noSpaceBlocks && blockR `elem` noSpaceBlocks -> tL <> tR | otherwise -> tL <> " " <> tR (Nothing, _r) -> tR (_l, Nothing) -> tL lastCharBlock, firstCharBlock :: TL.Text -> Maybe TIC.BlockCode lastCharBlock = fmap (TIC.blockCode . snd) . TL.unsnoc firstCharBlock = fmap (TIC.blockCode . fst) . TL.uncons noSpaceBlocks :: [TIC.BlockCode] noSpaceBlocks = [ TIC.CJKSymbolsAndPunctuation , TIC.Hiragana , TIC.Katakana -- TODO ]
- Follow-Ups:
- Re: [tlug] Unicode/ICU question about joining lines
- From: Darren Cook
- Re: [tlug] Unicode/ICU question about joining lines
- From: Jim Breen
- Re: [tlug] Unicode/ICU question about joining lines
- From: Raymond Wan
- [tlug] Unicode/ICU question about joining lines
- From: Stephen J. Turnbull
Home | Main Index | Thread Index
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):