Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: tlug: A couple of questions about Unicode
- To: tlug@example.com
- Subject: Re: tlug: A couple of questions about Unicode
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Date: Mon, 12 Jan 1998 12:19:58 +0900 (JST)
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=us-ascii
- In-Reply-To: <Pine.LNX.3.96.980110054231.305A-100000@example.com>
- References: <199801091753.CAA04104@example.com><Pine.LNX.3.96.980110054231.305A-100000@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug@example.com
>>>>> "David" == J David Beutel <jdb@example.com> writes: David> Wow! Everything I've read so far has said that Unicode is That's because you haven't read the Unicode standard or ISO-10646, and probably 90% of the authors you've read haven't either, and the remainder are keeping silent because it would just confuse you and weaken the technical case for Unicode. David> fixed-width. Where have you read that those Spanish chars These aren't chars, "char" is an integral C datatype of indeterminate length and signedness. :-) David> are 32-bits? How could, e.g., "ch" be distinguished from David> "c" "h"? What does it mean to be a single char? (That it To be a `char' is to be an element of a fixed width integral C data type, with implementation defined signedness of size not greater than `short'. The only reason I'm writing this wise-ass definition is that as far as I know it's the only definition of character that has universal acceptance (ANSI, POSIX, ISO). A bit of "ijime" aside, this is just plain _hard_. "What does it mean to be a single character?" is not a question that has a single answer, as far as I know. "Character-ness" is a combination of a number of properties. I don't know how to define `character'. None of the (few) standards I am familiar with define that concept; they assume it as a primitive. Some common characteristics (but all have exceptions to my knowledge): printed representation as a single glyph drawn from a set of glyphs (eg, the various ways to draw an "a", or TeX's \phi and \varphi), the encoded representation is a single constant of a data type, specific position in a collation order, a certain collection is sufficient to represent a given language in written form. An `encoded-character' is a sequence of octets (bytes) defined to represent a character by an encoding standard. A `glyph' is a pictorial representation of a character. A `code point' is a specific position in a character standard; it may or may not define a useful encoding (Japanese ku-ten comes to mind) or collation order (Unicode comes to mind). It simply asserts the uniqueness of the character (although Han Unification casts serious doubt on this, and due to the method used to collect the Chinese National Standard character sets it is certain that the complete CNS contains duplicates). David> should be displayed with a single glyph? That two separate David> glyphs should not be split across lines? Or is it a char David> in the sense that "qu" could be a char in English, since David> "q" is always followed by "u"?) As for Spanish, it is a rule of Spanish orthography and its collation order that 'ch' is to be treated as a single character. In Spanish, the sequence of characters `c' `h' doesn't exist, only the character `ch'. What this means is that there is no circumstance in Spanish usage in which it would be useful to treat that sequence of bytes separately. Thus in Spanish, the lexicographic order would be: Canada, Czech Republic, Chile because 'ch' is treated as a single character of two glyphs (two bytes) in Spanish. Under normal circumstances, "cz" is impossible in Spanish, but in a borrowed or nonsense word it will be treated as two characters. It is not the same as "qu" in English. The order is Qaddafy, quartz, qwerty because the 'q' and the 'u' are separate characters. Furthermore, as far as I know the Spanish _do not_ represent `ch' as a single code point; this is certainly true in ISO-8859-1. David> I cannot accept that Unicode is multibyte, rather than David> fixed-width. I know that there are multibyte encodings, David> e.g., UTF-8, but a major feature of Unicode is that it's David> fixed-width. Can you quote a reference? If UCS-2 level 3 is equivalent to Unicode, then ISO-10646 makes this quite clear. There is a list of "combining characters" (although the Spanish "ch" is not one of them as I recall---can't check, my copy of 10646 was defective and Maruzen demanded that I return that copy before they would procure a corrected one for me; God do I hate the University purchasing department for forcing me to deal with sloths). It is an option of 10646-conformant "devices" to specify a lower level of compliance, however, and the lower level of compliance makes it fixed width (I'm pretty sure; I forget what UTF-16 requires in this respect, and I guess that's not part of Unicode anyway). Natural languages are bloody complicated beasts; whatever you do, there are going to be exceptions in practice. The Unicode _standard_ can do one of two things: (1) make exceptions illegal or (2) provide code extension techniques. I'm sure the Original Intent of the Founding Fathers was (1), which would quite possibly be the death of Unicode rather than of exceptions, Murphy's Law being exceeded only by the Three Laws of Thermodynamics in terms of unbreakableness. So they backtracked and went for (2). For practical purposes in normal programming usage in 99% of the world, Unicode is fixed width. HTH Steve --------------------------------------------------------------- Next TLUG Nomikai: 14 January 1998 19:15 Tokyo station Yaesu Chuo ticket gate. Or go directly to Tengu TokyoEkiMae 19:30 Chuo-ku, Kyobashi 1-1-6, EchiZenYa Bld. B1/B2 03-3275-3691 Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station Yaesu Chuo ticket gate. --------------------------------------------------------------- a word from the sponsor: TWICS - Japan's First Public-Access Internet System www.twics.com info@example.com Tel:03-3351-5977 Fax:03-3353-6096
- References:
- Re: tlug: A couple of questions about Unicode
- From: kls@example.com (Ken Schwarz)
- Re: tlug: A couple of questions about Unicode
- From: "J. David Beutel" <jdb@example.com>
Home | Main Index | Thread Index
- Prev by Date: Re: tlug: various stuff
- Next by Date: Re: tlug: X FTP Client software
- Prev by thread: Re: tlug: A couple of questions about Unicode
- Next by thread: Re: tlug: A couple of questions about Unicode
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links