Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: A couple of questions about Unicode



>>>>> "David" == J David Beutel <jdb@example.com> writes:

    David> Wow!  Everything I've read so far has said that Unicode is

That's because you haven't read the Unicode standard or ISO-10646, and 
probably 90% of the authors you've read haven't either, and the
remainder are keeping silent because it would just confuse you and
weaken the technical case for Unicode.

    David> fixed-width.  Where have you read that those Spanish chars

These aren't chars, "char" is an integral C datatype of indeterminate
length and signedness.  :-)

    David> are 32-bits?  How could, e.g., "ch" be distinguished from
    David> "c" "h"?  What does it mean to be a single char?  (That it

To be a `char' is to be an element of a fixed width integral C data
type, with implementation defined signedness of size not greater than
`short'.  The only reason I'm writing this wise-ass definition is that
as far as I know it's the only definition of character that has
universal acceptance (ANSI, POSIX, ISO).

A bit of "ijime" aside, this is just plain _hard_.

"What does it mean to be a single character?" is not a question that
has a single answer, as far as I know.  "Character-ness" is a
combination of a number of properties.  I don't know how to define
`character'.  None of the (few) standards I am familiar with define
that concept; they assume it as a primitive.  Some common
characteristics (but all have exceptions to my knowledge): printed
representation as a single glyph drawn from a set of glyphs (eg, the
various ways to draw an "a", or TeX's \phi and \varphi), the encoded
representation is a single constant of a data type, specific position
in a collation order, a certain collection is sufficient to represent
a given language in written form.

An `encoded-character' is a sequence of octets (bytes) defined to
represent a character by an encoding standard.

A `glyph' is a pictorial representation of a character.

A `code point' is a specific position in a character standard; it may
or may not define a useful encoding (Japanese ku-ten comes to mind) or
collation order (Unicode comes to mind).  It simply asserts the
uniqueness of the character (although Han Unification casts serious
doubt on this, and due to the method used to collect the Chinese
National Standard character sets it is certain that the complete CNS
contains duplicates).

    David> should be displayed with a single glyph?  That two separate
    David> glyphs should not be split across lines?  Or is it a char
    David> in the sense that "qu" could be a char in English, since
    David> "q" is always followed by "u"?)

As for Spanish, it is a rule of Spanish orthography and its collation
order that 'ch' is to be treated as a single character.  In Spanish,
the sequence of characters `c' `h' doesn't exist, only the character
`ch'.  What this means is that there is no circumstance in Spanish
usage in which it would be useful to treat that sequence of bytes
separately.  Thus in Spanish, the lexicographic order would be:

		    Canada, Czech Republic, Chile

because 'ch' is treated as a single character of two glyphs (two bytes) in
Spanish.  Under normal circumstances, "cz" is impossible in Spanish,
but in a borrowed or nonsense word it will be treated as two
characters.  It is not the same as "qu" in English.  The order is

		       Qaddafy, quartz, qwerty

because the 'q' and the 'u' are separate characters.  Furthermore, as
far as I know the Spanish _do not_ represent `ch' as a single code
point; this is certainly true in ISO-8859-1.

    David> I cannot accept that Unicode is multibyte, rather than
    David> fixed-width.  I know that there are multibyte encodings,
    David> e.g., UTF-8, but a major feature of Unicode is that it's
    David> fixed-width.  Can you quote a reference?

If UCS-2 level 3 is equivalent to Unicode, then ISO-10646 makes this
quite clear.  There is a list of "combining characters" (although the
Spanish "ch" is not one of them as I recall---can't check, my copy of
10646 was defective and Maruzen demanded that I return that copy
before they would procure a corrected one for me; God do I hate the
University purchasing department for forcing me to deal with sloths).
It is an option of 10646-conformant "devices" to specify a lower level
of compliance, however, and the lower level of compliance makes it
fixed width (I'm pretty sure; I forget what UTF-16 requires in
this respect, and I guess that's not part of Unicode anyway).

Natural languages are bloody complicated beasts; whatever you do,
there are going to be exceptions in practice.  The Unicode _standard_
can do one of two things: (1) make exceptions illegal or (2) provide
code extension techniques.  I'm sure the Original Intent of the
Founding Fathers was (1), which would quite possibly be the death of
Unicode rather than of exceptions, Murphy's Law being exceeded only by
the Three Laws of Thermodynamics in terms of unbreakableness.  So they
backtracked and went for (2).

For practical purposes in normal programming usage in 99% of the
world, Unicode is fixed width.

HTH

Steve

---------------------------------------------------------------
Next TLUG Nomikai: 14 January 1998 19:15  Tokyo station
Yaesu Chuo ticket gate.  Or go directly to Tengu TokyoEkiMae 19:30
Chuo-ku, Kyobashi 1-1-6, EchiZenYa Bld. B1/B2 03-3275-3691
Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station
Yaesu Chuo ticket gate.
---------------------------------------------------------------
a word from the sponsor:
TWICS - Japan's First Public-Access Internet System
www.twics.com  info@example.com  Tel:03-3351-5977  Fax:03-3353-6096



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links