Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



Pardon the top-posting in the response, but I am struggling a little
with your basic question.

> Given a string containing a paragraph of
> text with "soft" line breaks,

What exactly do you mean by a _"soft" line break_? Is it a specific character?

> I want to output a string containing the
> text without line breaks.

Output to what? Write it to a file (as in fprintf() in C), display it
on a screen, chisel it on stone, ...?

>The way that lines are joined depends on the
> language.  Many languages such as English require spaces, while many
> languages such as Japanese do not use spaces.

Don't you really mean "[t]he way that lines are *broken* when
displaying, printing, etc. depends ....."?

Sorry if this is being difficult or pedantic, but I can't get my head
around the question itself.

Cheers

Jim


On Wed, 11 Aug 2021 at 18:25, Travis Cardwell
<travis.cardwell@example.com> wrote:
>
> Dear TLUG,
>
> I have a Unicode question, and I am posting to this mailing list because
> it appears that the Lingo list is not used these days.
>
> My problem is straightforward.  Given a string containing a paragraph of
> text with "soft" line breaks, I want to output a string containing the
> text without line breaks.  The way that lines are joined depends on the
> language.  Many languages such as English require spaces, while many
> languages such as Japanese do not use spaces.
>
> Unicode technical reports provide information about text segmentation
> and line breaking, and ICU provides functionality for breaking strings
> on boundaries of grapheme clusters.  I have not been able to find
> information or ICU functionality for joining strings, however.
>
> One idea for a solution is to join lines based on the Unicode blocks of
> the characters immediately before and after the line break.  I do not
> know if this strategy would handle all cases, however, and it would
> require classifying all Unicode blocks.  I am pasting the code for a
> prototype below.
>
> It seems like such functionality should exist.  Perhaps there is a
> different ICU property that can be used?  What am I missing?  Does
> anybody know of a better and/or easier way to solve this problem?
>
> Thanks!
>
> Travis
>
> ----
>
> -- https://hackage.haskell.org/package/text
> import qualified Data.Text.Lazy as TL
>
> -- https://hackage.haskell.org/package/text-icu
> import qualified Data.Text.ICU.Char as TIC
>
> foldLines :: TL.Text -> TL.Text
> foldLines = foldr1 go . TL.lines
>   where
>     go :: TL.Text -> TL.Text -> TL.Text
>     go tL tR = case (lastCharBlock tL, firstCharBlock tR) of
>       (Just blockL, Just blockR)
>         | blockL `elem` noSpaceBlocks &&
>           blockR `elem` noSpaceBlocks -> tL <> tR
>         | otherwise -> tL <> " " <> tR
>       (Nothing, _r) -> tR
>       (_l, Nothing) -> tL
>
>     lastCharBlock, firstCharBlock :: TL.Text -> Maybe TIC.BlockCode
>     lastCharBlock = fmap (TIC.blockCode . snd) . TL.unsnoc
>     firstCharBlock = fmap (TIC.blockCode . fst) . TL.uncons
>
>     noSpaceBlocks :: [TIC.BlockCode]
>     noSpaceBlocks =
>       [ TIC.CJKSymbolsAndPunctuation
>       , TIC.Hiragana
>       , TIC.Katakana
>       -- TODO
>       ]
>


-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/


Home | Main Index | Thread Index