Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]

To: tlug@example.com
Subject: Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
From: Gaspar Sinai <gsinai@example.com>
Date: Mon, 19 Jan 1998 20:45:39 +0000 (Local time zone must be set--see zic manual page`)
cc: gaspar.sinai@example.com
Content-Type: TEXT/PLAIN; charset=US-ASCII
In-Reply-To: <m0xuCEX-00012bC@example.com>
Reply-To: tlug@example.com
Sender: owner-tlug@example.com

Hi All,

I feel that UTF8 is the best tranformation format ever invented. And the
reason why I think Linux only gains with is because it is able to encode
the whole UCS4 space unlike the NT (UCS2). If I want to use Egyptian
scripts I can not use NT unless I use my propriatory format in user
defined space, which is not portable. Please read on.

On Mon, 19 Jan 1998, Stephen J. Turnbull wrote:

> >>>>> "Gaspar" == Gaspar Sinai <gsinai@example.com> writes:
> 
>     Gaspar> Hi,
>     Gaspar> I feel compelled to contribute to this thread. So here are
>     Gaspar> my thoughts:
> 
>     Gaspar>   I think linux only gains if it uses utf8 instead of ucs2.
> 
> I don't see this.  UTF-8 works like this, as I recall.  First of all,
> it's modal (which is bad in itself, but not terrible).  

I don't feel bad about modality. After all UCS2 itself is modal at some
places (even though they don't advertise this fact...). 

> In the start  state,
> 
> if (0x80 & byte) == 0x00, it's a single-byte character to be
>                           interpreted as GL of ISO-8859-1 (= US-ASCII?)
> else it's multibyte and
>   if (0xC0 & byte) == 0x80, it's a two-byte character with Unicode
>                           value == 256 * (0x3F & byte) + next-byte + 128
>   else it's two or more bytes
> 
> and it continues from there using the top bits to identify the length
> of a multi-byte sequence.  (What I meant by "modal" is that picking up
> a byte stream at an arbitrary place, trailing bytes in the range
> 0x00-0x7F can't be distinguished from ASCII unless you backtrack 8 (?
> or so) bytes, the longest multibyte sequence, or to the previous
> multibyte leader byte.)  Now, at best this can encode 256*64 + 256, or
> somewhat over 16K characters.  If I remember correctly, none of these
> are kanji or Devanagari (I could be wrong).  Definitely none of them
> are private space.

ASCII goes intact, as you said: if (ucs4<0x80) the charaters is copied.

But if (0xc0 & ucs4) == 0xc0 it is the beginning of a sequence. If
(0xc0 & ucs2) == 0x80 this is the middle of the multibyte sequence.

Here is our UCS4 space:
======================
Octet           Format          No. of          Maximum
usage           (binary)        free bits       UCS-4 value

1st of 1        0xxxxxxx        7               0000 007F
1st of 2        110xxxxx        5               0000 07FF // Most JP chars
1st of 3        1110xxxx        4               0000 FFFF
1st of 4        11110xxx        3               001F FFFF
1st of 5        111110xx        2               03FF FFFF
1st of 6        1111110x        1               7FFF FFFF

then

continuing )    10xxxxxx        6
2nd .. 6th )

As you can see it can encode the whole ucs4 space. 

It can fully map the whole UCS4 space. With recovery. If you jump
into a string you can always determine where it started.
The penalty is not high - the second half of UCS2 is 3 bytes, the rest
is 2 bytes.

> That means that in UTF-8 the majority of human beings on the planet
> require 3 bytes or more to write the vast majority of their text. 

I have just received a JIS encoded email from one of my providers.
(not paying moans...) in full Japanese. Here are the results of the
conversions.

With email header:
Encoding  Size
========  ====
JIS       3178
EUC       2836
UTF7      3432
UTF8      3519
MSoft TXT 4444

Without email header - full Japanese:
Encoding  Size
========  ====
JIS       2311
EUC       1969
UTF7      2562
UTF8      2652
MSoft TXT 2678

* MSoft TXT is basically the dump of UCS2 buffer.
As you can see most Japanese can be encoded with 2 bytes (slightly less
because of newlines e.t.ca).

> I think that in fact UTF-8 fixes the modality partially by requiring
> that trailing bytes be in the range 0x00-0x7F (this guarantees at most
> one corrupt character per error as you scan forward in the stream,
> although you don't know whether error results in one-for-one
> substitution---if a 2-byte leading byte gets dropped, the trailer
> becomes ASCII, or many-for-one substitution, or one-for-many, if an
> ASCII byte is corrupted to a leading byte), but that reduces the
> number of code points expressible in 2 bytes by nearly 1/2.

I hope you don't receive your ELF binaries the way you expect unicode
message :).

> That's an oops in my opinion, one which is going to make people like
> Ohta ("Now, Japanese is in Danger") even less happy than Unicode
> itself.

I am personaly happy that Japan received something that will make it
possible to eliminate a lot of  confusion.

To Craig:
========
I know I owe you something about utf8. I can get to it when I am back
from New York - next week.

Cheers,
gaspar

PS:
===
Those who want to test Netscape Communicator's utf8 encoding please jump
to my Hungarian Grammar pages in utf8:

    http://www2.gol.com/users/gsinai/Hungarian/ 

---------------------------------------------------------------
Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station
Yaesu Chuo ticket gate.
---------------------------------------------------------------
a word from the sponsor:
TWICS - Japan's First Public-Access Internet System
www.twics.com  info@example.com  Tel:03-3351-5977  Fax:03-3353-6096

Follow-Ups:
- Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
  - From: Craig Oda <craig@example.com>

References:
- UTF-8 [was: Re: tlug: A couple of questions about Unicode]
  - From: "Stephen J. Turnbull" <turnbull@example.com>

Prev by Date: Re: tlug: memory size strangeness(?)
Next by Date: tlug: Roasting TLUG CD-ROMS
Prev by thread: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
Next by thread: Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
Index(es):
- Date
- Thread

Home | Main Index | Thread Index