Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Unicode/ICU question about joining lines
- Date: Thu, 12 Aug 2021 11:58:08 +1000
- From: Jim Breen <jimbreen@example.com>
- Subject: Re: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com>
Pardon the top-posting in the response, but I am struggling a little with your basic question. > Given a string containing a paragraph of > text with "soft" line breaks, What exactly do you mean by a _"soft" line break_? Is it a specific character? > I want to output a string containing the > text without line breaks. Output to what? Write it to a file (as in fprintf() in C), display it on a screen, chisel it on stone, ...? >The way that lines are joined depends on the > language. Many languages such as English require spaces, while many > languages such as Japanese do not use spaces. Don't you really mean "[t]he way that lines are *broken* when displaying, printing, etc. depends ....."? Sorry if this is being difficult or pedantic, but I can't get my head around the question itself. Cheers Jim On Wed, 11 Aug 2021 at 18:25, Travis Cardwell <travis.cardwell@example.com> wrote: > > Dear TLUG, > > I have a Unicode question, and I am posting to this mailing list because > it appears that the Lingo list is not used these days. > > My problem is straightforward. Given a string containing a paragraph of > text with "soft" line breaks, I want to output a string containing the > text without line breaks. The way that lines are joined depends on the > language. Many languages such as English require spaces, while many > languages such as Japanese do not use spaces. > > Unicode technical reports provide information about text segmentation > and line breaking, and ICU provides functionality for breaking strings > on boundaries of grapheme clusters. I have not been able to find > information or ICU functionality for joining strings, however. > > One idea for a solution is to join lines based on the Unicode blocks of > the characters immediately before and after the line break. I do not > know if this strategy would handle all cases, however, and it would > require classifying all Unicode blocks. I am pasting the code for a > prototype below. > > It seems like such functionality should exist. Perhaps there is a > different ICU property that can be used? What am I missing? Does > anybody know of a better and/or easier way to solve this problem? > > Thanks! > > Travis > > ---- > > -- https://hackage.haskell.org/package/text > import qualified Data.Text.Lazy as TL > > -- https://hackage.haskell.org/package/text-icu > import qualified Data.Text.ICU.Char as TIC > > foldLines :: TL.Text -> TL.Text > foldLines = foldr1 go . TL.lines > where > go :: TL.Text -> TL.Text -> TL.Text > go tL tR = case (lastCharBlock tL, firstCharBlock tR) of > (Just blockL, Just blockR) > | blockL `elem` noSpaceBlocks && > blockR `elem` noSpaceBlocks -> tL <> tR > | otherwise -> tL <> " " <> tR > (Nothing, _r) -> tR > (_l, Nothing) -> tL > > lastCharBlock, firstCharBlock :: TL.Text -> Maybe TIC.BlockCode > lastCharBlock = fmap (TIC.blockCode . snd) . TL.unsnoc > firstCharBlock = fmap (TIC.blockCode . fst) . TL.uncons > > noSpaceBlocks :: [TIC.BlockCode] > noSpaceBlocks = > [ TIC.CJKSymbolsAndPunctuation > , TIC.Hiragana > , TIC.Katakana > -- TODO > ] > -- Jim Breen Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University http://www.jimbreen.org/ http://nihongo.monash.edu/
- Follow-Ups:
- Re: [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- References:
- [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):