TLUG Mailing List

Mailing List Archive
Support open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UTF-8 [was: Re: tlug: A couple of questions about Unicode]

To: tlug@example.com

Subject: UTF-8 [was: Re: tlug: A couple of questions about Unicode]

From: "Stephen J. Turnbull" <turnbull@example.com>

Date: Mon, 19 Jan 1998 17:07:01 +0900 (JST)

Content-Transfer-Encoding: 7bit

Content-Type: text/plain; charset=us-ascii

In-Reply-To: <Pine.LNX.3.95.980112000142.2645C-100000@example.com>

References: <199801091717.CAA03920@example.com><Pine.LNX.3.95.980112000142.2645C-100000@example.com>

Reply-To: tlug@example.com

Sender: owner-tlug@example.com
>>>>> "Gaspar" == Gaspar Sinai <gsinai@example.com> writes:

    Gaspar> Hi,
    Gaspar> I feel compelled to contribute to this thread. So here are
    Gaspar> my thoughts:

    Gaspar>   I think linux only gains if it uses utf8 instead of ucs2.

I don't see this.  UTF-8 works like this, as I recall.  First of all,
it's modal (which is bad in itself, but not terrible).  In the start
state,

if (0x80 & byte) == 0x00, it's a single-byte character to be
                          interpreted as GL of ISO-8859-1 (= US-ASCII?)
else it's multibyte and
  if (0xC0 & byte) == 0x80, it's a two-byte character with Unicode
                          value == 256 * (0x3F & byte) + next-byte + 128
  else it's two or more bytes

and it continues from there using the top bits to identify the length
of a multi-byte sequence.  (What I meant by "modal" is that picking up
a byte stream at an arbitrary place, trailing bytes in the range
0x00-0x7F can't be distinguished from ASCII unless you backtrack 8 (?
or so) bytes, the longest multibyte sequence, or to the previous
multibyte leader byte.)  Now, at best this can encode 256*64 + 256, or
somewhat over 16K characters.  If I remember correctly, none of these
are kanji or Devanagari (I could be wrong).  Definitely none of them
are private space.

That means that in UTF-8 the majority of human beings on the planet
require 3 bytes or more to write the vast majority of their text.  I
think that in fact UTF-8 fixes the modality partially by requiring
that trailing bytes be in the range 0x00-0x7F (this guarantees at most
one corrupt character per error as you scan forward in the stream,
although you don't know whether error results in one-for-one
substitution---if a 2-byte leading byte gets dropped, the trailer
becomes ASCII, or many-for-one substitution, or one-for-many, if an
ASCII byte is corrupted to a leading byte), but that reduces the
number of code points expressible in 2 bytes by nearly 1/2.

That's an oops in my opinion, one which is going to make people like
Ohta ("Now, Japanese is in Danger") even less happy than Unicode
itself.
---------------------------------------------------------------
Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station
Yaesu Chuo ticket gate.
---------------------------------------------------------------
a word from the sponsor:
TWICS - Japan's First Public-Access Internet System
www.twics.com  info@example.com  Tel:03-3351-5977  Fax:03-3353-6096
Follow-Ups:

Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
From: Gaspar Sinai <gsinai@example.com>

References:

tlug: A couple of questions about Unicode
From: "Jonathan Byrne" <jbyrne@example.com>

Re: tlug: A couple of questions about Unicode
From: Gaspar Sinai <gsinai@example.com>

Prev by Date: Re: tlug: A couple of questions about Unicode

Next by Date: Re: tlug: memory size strangeness(?)

Prev by thread: Re: tlug: A couple of questions about Unicode

Next by thread: Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links