Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- To: tlug@example.com
- Subject: Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- From: Gaspar Sinai <gsinai@example.com>
- Date: Mon, 19 Jan 1998 20:45:39 +0000 (Local time zone must be set--see zic manual page`)
- cc: gaspar.sinai@example.com
- Content-Type: TEXT/PLAIN; charset=US-ASCII
- In-Reply-To: <m0xuCEX-00012bC@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug@example.com
Hi All, I feel that UTF8 is the best tranformation format ever invented. And the reason why I think Linux only gains with is because it is able to encode the whole UCS4 space unlike the NT (UCS2). If I want to use Egyptian scripts I can not use NT unless I use my propriatory format in user defined space, which is not portable. Please read on. On Mon, 19 Jan 1998, Stephen J. Turnbull wrote: > >>>>> "Gaspar" == Gaspar Sinai <gsinai@example.com> writes: > > Gaspar> Hi, > Gaspar> I feel compelled to contribute to this thread. So here are > Gaspar> my thoughts: > > Gaspar> I think linux only gains if it uses utf8 instead of ucs2. > > I don't see this. UTF-8 works like this, as I recall. First of all, > it's modal (which is bad in itself, but not terrible). I don't feel bad about modality. After all UCS2 itself is modal at some places (even though they don't advertise this fact...). > In the start state, > > if (0x80 & byte) == 0x00, it's a single-byte character to be > interpreted as GL of ISO-8859-1 (= US-ASCII?) > else it's multibyte and > if (0xC0 & byte) == 0x80, it's a two-byte character with Unicode > value == 256 * (0x3F & byte) + next-byte + 128 > else it's two or more bytes > > and it continues from there using the top bits to identify the length > of a multi-byte sequence. (What I meant by "modal" is that picking up > a byte stream at an arbitrary place, trailing bytes in the range > 0x00-0x7F can't be distinguished from ASCII unless you backtrack 8 (? > or so) bytes, the longest multibyte sequence, or to the previous > multibyte leader byte.) Now, at best this can encode 256*64 + 256, or > somewhat over 16K characters. If I remember correctly, none of these > are kanji or Devanagari (I could be wrong). Definitely none of them > are private space. ASCII goes intact, as you said: if (ucs4<0x80) the charaters is copied. But if (0xc0 & ucs4) == 0xc0 it is the beginning of a sequence. If (0xc0 & ucs2) == 0x80 this is the middle of the multibyte sequence. Here is our UCS4 space: ====================== Octet Format No. of Maximum usage (binary) free bits UCS-4 value 1st of 1 0xxxxxxx 7 0000 007F 1st of 2 110xxxxx 5 0000 07FF // Most JP chars 1st of 3 1110xxxx 4 0000 FFFF 1st of 4 11110xxx 3 001F FFFF 1st of 5 111110xx 2 03FF FFFF 1st of 6 1111110x 1 7FFF FFFF then continuing ) 10xxxxxx 6 2nd .. 6th ) As you can see it can encode the whole ucs4 space. It can fully map the whole UCS4 space. With recovery. If you jump into a string you can always determine where it started. The penalty is not high - the second half of UCS2 is 3 bytes, the rest is 2 bytes. > That means that in UTF-8 the majority of human beings on the planet > require 3 bytes or more to write the vast majority of their text. I have just received a JIS encoded email from one of my providers. (not paying moans...) in full Japanese. Here are the results of the conversions. With email header: Encoding Size ======== ==== JIS 3178 EUC 2836 UTF7 3432 UTF8 3519 MSoft TXT 4444 Without email header - full Japanese: Encoding Size ======== ==== JIS 2311 EUC 1969 UTF7 2562 UTF8 2652 MSoft TXT 2678 * MSoft TXT is basically the dump of UCS2 buffer. As you can see most Japanese can be encoded with 2 bytes (slightly less because of newlines e.t.ca). > I think that in fact UTF-8 fixes the modality partially by requiring > that trailing bytes be in the range 0x00-0x7F (this guarantees at most > one corrupt character per error as you scan forward in the stream, > although you don't know whether error results in one-for-one > substitution---if a 2-byte leading byte gets dropped, the trailer > becomes ASCII, or many-for-one substitution, or one-for-many, if an > ASCII byte is corrupted to a leading byte), but that reduces the > number of code points expressible in 2 bytes by nearly 1/2. I hope you don't receive your ELF binaries the way you expect unicode message :). > That's an oops in my opinion, one which is going to make people like > Ohta ("Now, Japanese is in Danger") even less happy than Unicode > itself. I am personaly happy that Japan received something that will make it possible to eliminate a lot of confusion. To Craig: ======== I know I owe you something about utf8. I can get to it when I am back from New York - next week. Cheers, gaspar PS: === Those who want to test Netscape Communicator's utf8 encoding please jump to my Hungarian Grammar pages in utf8: http://www2.gol.com/users/gsinai/Hungarian/ --------------------------------------------------------------- Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station Yaesu Chuo ticket gate. --------------------------------------------------------------- a word from the sponsor: TWICS - Japan's First Public-Access Internet System www.twics.com info@example.com Tel:03-3351-5977 Fax:03-3353-6096
- Follow-Ups:
- Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- From: Craig Oda <craig@example.com>
- References:
- UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- From: "Stephen J. Turnbull" <turnbull@example.com>
Home | Main Index | Thread Index
- Prev by Date: Re: tlug: memory size strangeness(?)
- Next by Date: tlug: Roasting TLUG CD-ROMS
- Prev by thread: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- Next by thread: Re: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links