Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Unicode/ICU question about joining lines
- Date: Thu, 12 Aug 2021 15:09:03 +0900
- From: Travis Cardwell <travis.cardwell@example.com>
- Subject: Re: [tlug] Unicode/ICU question about joining lines
- References: <CACaJP_QGLoO=qFPSQYUFp3PvZy7O7PTEBFvjBaom4-vPuHZLmw@mail.gmail.com> <CAAhy3dufsFDgNaF0V5yq0-VxKSyC5kkF4kF7dLabnYKk8o67rQ@mail.gmail.com> <CADR0rncr_fhnuKBzk1qqx=3niZnHBEQp31k5XrjnuFzdNwR-Vg@mail.gmail.com>
On Thu, Aug 12, 2021 at 2:42 PM Benjamin Kowarsch wrote: > Does the language need to be detected by analysing the text? My goal is to create a function that determines how to join the lines/fragments of text automatically, based on the content. In my first post, I included some code that does this based on the Unicode block of neighboring characters. This strategy works, but it requires classifying the many Unicode blocks, and I hoped that there is an easier way. > Or can the language be an input parameter? I am also considering this option. In this case, I will expose a configuration option to let users decide how text is joined. > If it can be an input parameter this is quite trivial. > > For languages without whitespace separation between words: > Simply process the input to delete the soft line breaks. > > For languages with whitespace separation between words: > Simply process the input to replace the soft line breaks with whitespace > UNLESS that soft-break is preceded or followed by a whitespace, in which > case you can simply delete the soft-break. In my case, the input is already parsed, so I do not need to worry about cases such as extra whitespace. Since I am always trying to get you interested in Haskell, here is an implementation! ;) data JoinType = JoinWithSpace | JoinWithoutSpace foldText :: JoinType -> [TL.Text] -> TL.Text foldText = \case JoinWithSpace -> TL.unwords JoinWithoutSpace -> mconcat (I am pasting the full source at the bottom of this email in case you want to try a demo.) > If you need to detect the language, you might want to do this as a > preparatory step before doing the above. Indeed; separation of concerns is great advice! Cheers, Travis ---- {-# LANGUAGE LambdaCase #-} {-# LANGUAGE OverloadedStrings #-} module Main where import qualified Data.Text.Lazy as TL import qualified Data.Text.Lazy.IO as TLIO data JoinType = JoinWithSpace | JoinWithoutSpace foldText :: JoinType -> [TL.Text] -> TL.Text foldText = \case JoinWithSpace -> TL.unwords JoinWithoutSpace -> mconcat main :: IO () main = TLIO.putStrLn $ foldText JoinWithoutSpace ["日本語の", "例です。"]
- Follow-Ups:
- Re: [tlug] Unicode/ICU question about joining lines
- From: eizietheez
- References:
- [tlug] Unicode/ICU question about joining lines
- From: Travis Cardwell
- Re: [tlug] Unicode/ICU question about joining lines
- From: Raymond Wan
- Re: [tlug] Unicode/ICU question about joining lines
- From: Benjamin Kowarsch
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Unicode/ICU question about joining lines
- Next by Date: Re: [tlug] Unicode/ICU question about joining lines
- Previous by thread: Re: [tlug] Unicode/ICU question about joining lines
- Next by thread: Re: [tlug] Unicode/ICU question about joining lines
- Index(es):