TLUG Mailing List

This has useful info about line breaking in Unicode: http://www.unicode.org/reports/tr14/#Properties

Most notably, a newline is *not* a soft break in Unicode. So you may need to elide or replace them in your source text. This is necessarily language aware, but a simple heuristic will cover many cases. If the codepoint on either side is Latin or Cyrillic, then turn them into spaces, otherwise elide them.

Then you can apply the Unicode line breaking algorithm with likely good results.

On 12 Aug 2021 12:22, Travis Cardwell <travis.cardwell@example.com> wrote:

On Thu, Aug 12, 2021 at 11:07 AM Jim Breen wrote:
> > Given a string containing a paragraph of
> > text with "soft" line breaks,
>
> What exactly do you mean by a _"soft" line break_? Is it a specific
> character?

In document/markup languages, a soft line break is a line break in the
source code that does not represent a line break in the actual content.
The line breaks are the usual (`\n`), not a special character.

For example, (La)TeX allows you to write a single paragraph by
"wrapping" text across multiple lines (using "soft line breaks"). The
soft line breaks in the source do not determine where lines are broken
in the output. The term "soft" is used to distinguish this type of line
break from "hard" line breaks in the output (using `\\`, `\newline`, or
`\hfill \break` for example).

> > I want to output a string containing the
> > text without line breaks.
>
> Output to what? Write it to a file (as in fprintf() in C), display it
> on a screen, chisel it on stone, ...?

I wrote that with a function in mind. The input of the function is a
string that may contain newlines, and the output of the function is a
string that does not contain newlines. Such a function could be used
with input read from a file (or `STDIN`/API/database), and the output
could be written to a file (or `STDOUT`/API/database).

> > The way that lines are joined depends on the
> > language. Many languages such as English require spaces, while many
> > languages such as Japanese do not use spaces.
>
> Don't you really mean "[t]he way that lines are *broken* when
> displaying, printing, etc. depends ....."?

I think that examples may best illustrate the motivation. Consider the
following English sentence, which is split into two lines using a soft
line break:

    This is an
    English example.

The input string is `This is an\nEnglish example.` (which could have a
trailing line break, but that is unrelated to this problem). The
function should return `This is an English example.` in this case
because English uses spaces to separate words.

Here is a Japanese sentence, which is split into two lines using a soft
line break:

    これは日本語の
    例です。

The input string is `これは日本語の\n例です。`, and the function should
return `これは日本語の例です。` in this case because Japanese does not
use spaces to separate words.

ICU provides an API for breaking text, but I do not know of a good way
to "join" lines of text like this.

> Sorry if this is being difficult or pedantic, but I can't get my head
> around the question itself.

No problem at all! In my attempt to keep my question concise, I was not
very clear. Sorry about that!

Cheers,

Travis

Re: [tlug] Unicode/ICU question about joining lines