Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: tlug: A couple of questions about Unicode
- To: tlug@example.com
- Subject: Re: tlug: A couple of questions about Unicode
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Date: Mon, 19 Jan 1998 13:31:48 +0900 (JST)
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=us-ascii
- In-Reply-To: <Pine.LNX.3.96.980115112345.186A-100000@example.com>
- References: <Pine.LNX.3.95.980112000142.2645C-100000@example.com><Pine.LNX.3.96.980115112345.186A-100000@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug@example.com
>>>>> "David" == J David Beutel <jdb@example.com> writes: David> On Mon, 12 Jan 1998, Gaspar Sinai wrote: > o there is a codespace where two 16-bit characters are used to map a > portion of the UCS4 space into UCS2. > o if you want to process some Indian or Arabic scripts you need to > combine two 16-bit unicode character to form a single glyph. Don't forget that various scripts, even Latin (and certainly the voice marks used in half-width katakana are available), have accents and so on that can be used to compose characters (I imagine it's possible to write French using "ASCII" plus accents, though noone would want to), and that there are all those Hangul jamo. These are optional from the point of view of composing or transliterating documents for storage or transmission as Unicode without semantic loss, but essential for roundtripping. David> I still have difficulty believing that such a nasty hack David> was done. Can we pretend that it wasn't? _You_ can _do_ anything you want. You just can't say you're supporting Unicode. You can get rid of a lot of the functionality you wish would go away by refusing to support UCS-2 level 2 or 3. You can call it a "nasty hack" if that makes you feel better. But I find it very difficult to imagine that at this stage of linguistic science we know enough to create a single Unicode that handles all natural languages without some nasty hacks. And remember, Unicode is not just for display, but also for input and transmission, which have problems of their own. For example, suppose you implement a French input method for US keyboards with the internal code being Unicode. Then it makes sense for an <accented a> to be represented in the input stream as <'a>. It may make sense for that accent to be a separate code point so that both <accented a> and "'a" can be typed in two keystrokes. I don't know, I'm not a specialist. I don't want to restrict what specialists can do, however, especially in these complicated scripts like Hangul and Devanagari. Watching Devanagari be input reminds me of ants at a picnic, the way the characters go back and forth in the display while there is a definite sense of forward progress. Unicodes V.1 and V.2 do very well at handling almost everything that is useful to a systems programmer in a two-byte widechar format, and adds some escape clauses that allow Unicode to be useful in 99% of the rest of linguistic situations at the expense of admitting multi-byte representations. > o Unicode is not consistent to the rules it set to itself. You would > expect that the wide ASCII characters would have the ASCII values just > like wide Cyrillic or Greek but this is not the case. For some strange > reason they kept the wide ASCII. As David says, I believe this is due to the "source separation rule". Source separation does not apply to Cyrillic and Greek because there's only one copy of each in JIS X 0208. David> I agree that wide ASCII does not deserve its own David> encoding--it should be a font thing. I don't know about David> wide Cyrillic nor Greek, but my understanding about the David> wide ASCII is that Unicode includes distinct encodings for David> all chars of all national character sets, including JIS X David> 0208, which has both normal and wide ASCII. No chars David> within a single national char set were unified, so that David> encoding translations of any such set to Unicode and back David> will be identical. I think it's a good rule. Of course, David> Shift-JIS is not a national char set. Be careful. Shift-JIS is an encoding which contains code points for several national character sets. As far as I know, round-tripping to Unicode and back to SJIS is possible. --------------------------------------------------------------- Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station Yaesu Chuo ticket gate. --------------------------------------------------------------- a word from the sponsor: TWICS - Japan's First Public-Access Internet System www.twics.com info@example.com Tel:03-3351-5977 Fax:03-3353-6096
- References:
- Re: tlug: A couple of questions about Unicode
- From: Gaspar Sinai <gsinai@example.com>
- Re: tlug: A couple of questions about Unicode
- From: "J. David Beutel" <jdb@example.com>
Home | Main Index | Thread Index
- Prev by Date: Re: tlug: SCSI Configuration
- Next by Date: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- Prev by thread: Re: tlug: A couple of questions about Unicode
- Next by thread: UTF-8 [was: Re: tlug: A couple of questions about Unicode]
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links