Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



On Thu, Aug 12, 2021 at 2:42 PM Benjamin Kowarsch wrote:
> Does the language need to be detected by analysing the text?

My goal is to create a function that determines how to join the
lines/fragments of text automatically, based on the content.  In my
first post, I included some code that does this based on the Unicode
block of neighboring characters.  This strategy works, but it requires
classifying the many Unicode blocks, and I hoped that there is an
easier way.

> Or can the language be an input parameter?

I am also considering this option.  In this case, I will expose a
configuration option to let users decide how text is joined.

> If it can be an input parameter this is quite trivial.
>
> For languages without whitespace separation between words:
> Simply process the input to delete the soft line breaks.
>
> For languages with whitespace separation between words:
> Simply process the input to replace the soft line breaks with whitespace
> UNLESS that soft-break is preceded or followed by a whitespace, in which
> case you can simply delete the soft-break.

In my case, the input is already parsed, so I do not need to worry about
cases such as extra whitespace.

Since I am always trying to get you interested in Haskell, here is an
implementation! ;)

    data JoinType = JoinWithSpace | JoinWithoutSpace

    foldText :: JoinType -> [TL.Text] -> TL.Text
    foldText = \case
        JoinWithSpace    -> TL.unwords
        JoinWithoutSpace -> mconcat

(I am pasting the full source at the bottom of this email in case you
want to try a demo.)

> If you need to detect the language, you might want to do this as a
> preparatory step before doing the above.

Indeed; separation of concerns is great advice!

Cheers,

Travis

----

{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE OverloadedStrings #-}

module Main where

import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TLIO

data JoinType = JoinWithSpace | JoinWithoutSpace

foldText :: JoinType -> [TL.Text] -> TL.Text
foldText = \case
    JoinWithSpace    -> TL.unwords
    JoinWithoutSpace -> mconcat

main :: IO ()
main = TLIO.putStrLn $ foldText JoinWithoutSpace ["日本語の", "例です。"]


Home | Main Index | Thread Index