Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- To: tlug@example.com
- Subject: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Date: Mon, 19 Jan 1998 17:07:01 +0900 (JST)
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=us-ascii
- In-Reply-To: <Pine.LNX.3.95.980112000142.2645C-100000@example.com>
- References: <199801091717.CAA03920@example.com><Pine.LNX.3.95.980112000142.2645C-100000@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug@example.com
>>>>> "Gaspar" == Gaspar Sinai <gsinai@example.com> writes: Gaspar> Hi, Gaspar> I feel compelled to contribute to this thread. So here are Gaspar> my thoughts: Gaspar> I think linux only gains if it uses utf8 instead of ucs2. I don't see this. UTF-8 works like this, as I recall. First of all, it's modal (which is bad in itself, but not terrible). In the start state, if (0x80 & byte) == 0x00, it's a single-byte character to be interpreted as GL of ISO-8859-1 (= US-ASCII?) else it's multibyte and if (0xC0 & byte) == 0x80, it's a two-byte character with Unicode value == 256 * (0x3F & byte) + next-byte + 128 else it's two or more bytes and it continues from there using the top bits to identify the length of a multi-byte sequence. (What I meant by "modal" is that picking up a byte stream at an arbitrary place, trailing bytes in the range 0x00-0x7F can't be distinguished from ASCII unless you backtrack 8 (? or so) bytes, the longest multibyte sequence, or to the previous multibyte leader byte.) Now, at best this can encode 256*64 + 256, or somewhat over 16K characters. If I remember correctly, none of these are kanji or Devanagari (I could be wrong). Definitely none of them are private space. That means that in UTF-8 the majority of human beings on the planet require 3 bytes or more to write the vast majority of their text. I think that in fact UTF-8 fixes the modality partially by requiring that trailing bytes be in the range 0x00-0x7F (this guarantees at most one corrupt character per error as you scan forward in the stream, although you don't know whether error results in one-for-one substitution---if a 2-byte leading byte gets dropped, the trailer becomes ASCII, or many-for-one substitution, or one-for-many, if an ASCII byte is corrupted to a leading byte), but that reduces the number of code points expressible in 2 bytes by nearly 1/2. That's an oops in my opinion, one which is going to make people like Ohta ("Now, Japanese is in Danger") even less happy than Unicode itself. --------------------------------------------------------------- Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station Yaesu Chuo ticket gate. --------------------------------------------------------------- a word from the sponsor: TWICS - Japan's First Public-Access Internet System www.twics.com info@example.com Tel:03-3351-5977 Fax:03-3353-6096
- Follow-Ups:
- Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- From: Gaspar Sinai <gsinai@example.com>
- References:
- tlug: A couple of questions about Unicode
- From: "Jonathan Byrne" <jbyrne@example.com>
- Re: tlug: A couple of questions about Unicode
- From: Gaspar Sinai <gsinai@example.com>
Home | Main Index | Thread Index
- Prev by Date: Re: tlug: A couple of questions about Unicode
- Next by Date: Re: tlug: memory size strangeness(?)
- Prev by thread: Re: tlug: A couple of questions about Unicode
- Next by thread: Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links