Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Unicode/ICU question about joining lines



On Thu, 12 Aug 2021 at 13:22, Travis Cardwell
<travis.cardwell@example.com> wrote:
> On Thu, Aug 12, 2021 at 11:07 AM Jim Breen wrote:
> > > Given a string containing a paragraph of
> > > text with "soft" line breaks,
> > What exactly do you mean by a _"soft" line break_? Is it a specific
> > character?
>
> In document/markup languages, a soft line break is a line break in the
> source code that does not represent a line break in the actual content.
> The line breaks are the usual (`\n`), not a special character.

OK, we now have some context. It's text with embedded markup
sequences, which are aimed towards some downstream process to
interpret for line-folding purposes. It might be "\n" or it might be
"<br>" in the case of HTML. Nothing really to do with Unicode itself.

> For example, (La)TeX allows you to write a single paragraph by
> "wrapping" text across multiple lines (using "soft line breaks").  The
> soft line breaks in the source do not determine where lines are broken
> in the output.  The term "soft" is used to distinguish this type of line
> break from "hard" line breaks in the output (using `\\`, `\newline`, or
> `\hfill \break` for example).

Yes, LaTeX is a good example of such a downstream process.

> > > I want to output a string containing the
> > > text without line breaks.
> >
> > Output to what? Write it to a file (as in fprintf() in C), display it
> > on a screen, chisel it on stone, ...?
>
> I wrote that with a function in mind.  The input of the function is a
> string that may contain newlines, and the output of the function is a
> string that does not contain newlines.  Such a function could be used
> with input read from a file (or `STDIN`/API/database), and the output
> could be written to a file (or `STDOUT`/API/database).

Again, context is king.

> > > The way that lines are joined depends on the
> > > language.  Many languages such as English require spaces, while many
> > > languages such as Japanese do not use spaces.
> >
> > Don't you really mean "[t]he way that lines are *broken* when
> > displaying, printing, etc. depends ....."?
>
> I think that examples may best illustrate the motivation.  Consider the
> following English sentence, which is split into two lines using a soft
> line break:
>
>     This is an
>     English example.
>
> The input string is `This is an\nEnglish example.` (which could have a
> trailing line break, but that is unrelated to this problem).  The
> function should return `This is an English example.` in this case
> because English uses spaces to separate words.

Quite easy to do, but it would need to be told the details of the
sequence to handle (<br>, \n, etc.) and what to do with it (for
English replace with a space, for Japanese append to preceding
characters.)
[...]
> ICU provides an API for breaking text, but I do not know of a good way
> to "join" lines of text like this.

Yes, it's nothing really to do with ICU, in fact Unicode generally
tries to get as far away as possible from markup or text presentation
issues. It does have some ancillary information about the
line-breaking properties of characters to help downstream processes,
but that's about all.

Cheers

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/


Home | Main Index | Thread Index