
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Unicode/ICU question about joining lines
On Thu, Aug 12, 2021 at 2:42 PM Benjamin Kowarsch wrote:
> Does the language need to be detected by analysing the text?
My goal is to create a function that determines how to join the
lines/fragments of text automatically, based on the content. In my
first post, I included some code that does this based on the Unicode
block of neighboring characters. This strategy works, but it requires
classifying the many Unicode blocks, and I hoped that there is an
easier way.
> Or can the language be an input parameter?
I am also considering this option. In this case, I will expose a
configuration option to let users decide how text is joined.
> If it can be an input parameter this is quite trivial.
>
> For languages without whitespace separation between words:
> Simply process the input to delete the soft line breaks.
>
> For languages with whitespace separation between words:
> Simply process the input to replace the soft line breaks with whitespace
> UNLESS that soft-break is preceded or followed by a whitespace, in which
> case you can simply delete the soft-break.
In my case, the input is already parsed, so I do not need to worry about
cases such as extra whitespace.
Since I am always trying to get you interested in Haskell, here is an
implementation! ;)
data JoinType = JoinWithSpace | JoinWithoutSpace
foldText :: JoinType -> [TL.Text] -> TL.Text
foldText = \case
JoinWithSpace -> TL.unwords
JoinWithoutSpace -> mconcat
(I am pasting the full source at the bottom of this email in case you
want to try a demo.)
> If you need to detect the language, you might want to do this as a
> preparatory step before doing the above.
Indeed; separation of concerns is great advice!
Cheers,
Travis
----
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.IO as TLIO
data JoinType = JoinWithSpace | JoinWithoutSpace
foldText :: JoinType -> [TL.Text] -> TL.Text
foldText = \case
JoinWithSpace -> TL.unwords
JoinWithoutSpace -> mconcat
main :: IO ()
main = TLIO.putStrLn $ foldText JoinWithoutSpace ["日本語の", "例です。"]
Home |
Main Index |
Thread Index