Re: [tlug] Unicode/ICU question about joining lines

Pardon the top-posting in the response, but I am struggling a little
with your basic question.

> Given a string containing a paragraph of
> text with "soft" line breaks,

What exactly do you mean by a _"soft" line break_? Is it a specific character?

> I want to output a string containing the
> text without line breaks.

Output to what? Write it to a file (as in fprintf() in C), display it
on a screen, chisel it on stone, ...?

>The way that lines are joined depends on the
> language.  Many languages such as English require spaces, while many
> languages such as Japanese do not use spaces.

Don't you really mean "[t]he way that lines are *broken* when
displaying, printing, etc. depends ....."?

Sorry if this is being difficult or pedantic, but I can't get my head
around the question itself.



> ----
> --
> import qualified Data.Text.Lazy as TL
> --
> import qualified Data.Text.ICU.Char as TIC
> foldLines :: TL.Text -> TL.Text
> foldLines = foldr1 go . TL.lines
>   where
>     go :: TL.Text -> TL.Text -> TL.Text
>     go tL tR = case (lastCharBlock tL, firstCharBlock tR) of
>       (Just blockL, Just blockR)
>         | blockL `elem` noSpaceBlocks &&
>           blockR `elem` noSpaceBlocks -> tL <> tR
>         | otherwise -> tL <> " " <> tR
>       (Nothing, _r) -> tR
>       (_l, Nothing) -> tL
>     lastCharBlock, firstCharBlock :: TL.Text -> Maybe TIC.BlockCode
>     lastCharBlock = fmap (TIC.blockCode . snd) . TL.unsnoc
>     firstCharBlock = fmap (TIC.blockCode . fst) . TL.uncons
>     noSpaceBlocks :: [TIC.BlockCode]
>     noSpaceBlocks =
>       [ TIC.CJKSymbolsAndPunctuation
>       , TIC.Hiragana
>       , TIC.Katakana
>       -- TODO
>       ]

Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University

