Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UTF-8 [was: Re: tlug: A couple of questions about Unicode]



>>>>> "Gaspar" == Gaspar Sinai <gsinai@example.com> writes:

    Gaspar> Hi,
    Gaspar> I feel compelled to contribute to this thread. So here are
    Gaspar> my thoughts:

    Gaspar>   I think linux only gains if it uses utf8 instead of ucs2.

I don't see this.  UTF-8 works like this, as I recall.  First of all,
it's modal (which is bad in itself, but not terrible).  In the start
state,

if (0x80 & byte) == 0x00, it's a single-byte character to be
                          interpreted as GL of ISO-8859-1 (= US-ASCII?)
else it's multibyte and
  if (0xC0 & byte) == 0x80, it's a two-byte character with Unicode
                          value == 256 * (0x3F & byte) + next-byte + 128
  else it's two or more bytes

and it continues from there using the top bits to identify the length
of a multi-byte sequence.  (What I meant by "modal" is that picking up
a byte stream at an arbitrary place, trailing bytes in the range
0x00-0x7F can't be distinguished from ASCII unless you backtrack 8 (?
or so) bytes, the longest multibyte sequence, or to the previous
multibyte leader byte.)  Now, at best this can encode 256*64 + 256, or
somewhat over 16K characters.  If I remember correctly, none of these
are kanji or Devanagari (I could be wrong).  Definitely none of them
are private space.

That means that in UTF-8 the majority of human beings on the planet
require 3 bytes or more to write the vast majority of their text.  I
think that in fact UTF-8 fixes the modality partially by requiring
that trailing bytes be in the range 0x00-0x7F (this guarantees at most
one corrupt character per error as you scan forward in the stream,
although you don't know whether error results in one-for-one
substitution---if a 2-byte leading byte gets dropped, the trailer
becomes ASCII, or many-for-one substitution, or one-for-many, if an
ASCII byte is corrupted to a leading byte), but that reduces the
number of code points expressible in 2 bytes by nearly 1/2.

That's an oops in my opinion, one which is going to make people like
Ohta ("Now, Japanese is in Danger") even less happy than Unicode
itself.
---------------------------------------------------------------
Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station
Yaesu Chuo ticket gate.
---------------------------------------------------------------
a word from the sponsor:
TWICS - Japan's First Public-Access Internet System
www.twics.com  info@example.com  Tel:03-3351-5977  Fax:03-3353-6096



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links