Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Re: Unicode (Was: apache2...)



Shimpei Yamashita <shimpei@example.com> wrote:
>> A few questions, from a complete amateur....
>> 
>> On Sat, Jul 12, 2003 at 12:45:28AM +1000,
>> Jim Breen wrote:
>> > Things don't "look" like anything in Unicode. The look comes from the
>> > font. You choose the font. You buy a Chinese-style Unicode font where 
>> > the hanzi look Chinese, or you buy a Japanese-style font. The codes
>> > stay the same.
>> 
>> Does that mean that a multilingual text document, rendered with a single
>> Unicode font, may only "look" correct in one Asian language at a time? 

Depending on the font, yes.

>> If so,
>> does it not mean that Unicode only *pretends* to be context-independent, and
>> actually depends on the user (which could be the application or the human
>> being) to provide that context because it fails to provide a context-
>> presentation mechanism internally?

Not at all. There are language codes in Unicode, and if the document has
been prepared with them, a smart application can do things like
selecting fonts according to them, or invoking spell-checkers according
to the language, or all the other language-dependent things. It's the
same with A,a,B,b, etc. Different European cultures actually have their
preferred fonts and think others look foreign, but no-one has accused
ISO-8859-* of pretense or cultural hegemony on this score.

>> > Be that as it may, EVERY kanji in JIS X 0208 and JIS X 0212 ended up in
>> > Unicode 1.0. What is called the "source separation rule" meant that if 
>> > a kanji/hanzi/hanja pair that would otherwise be unified occurs
>> > multiply in one of the national standards, then it appears multiply in
>> > Unicode. Thus all six version of the "ken" kanji, which blind Freddie
>> > could tell are really the same, are dutifully replicated in Unicode,
>> > because that's the way they are in JIS X 0208.
>> 
>> That doesn't seem to solve the above problem at all, which involves
>> *different* countries using different glyphs for the "same" character.

No, I mentioned that because people still say Unicode is "missing some
kanji",  and "was prepared ignoring national wishes", which  is where
this thread started.

>> Jim, what I don't quite understand is this: exactly what problem is Unicode
>> meant to solve anyway? 

The key problem was the inability of the pre-Unicode codes to mix
languages in a usable way. Have you ever tried to mix Japanese with 
French or German? It was only possible before Unicode by using ISO-2022
escaping which is a truly horrible way to handle text. In the case of
the "CJK" languages it was worse. At least with ordinary alphabetics an
"a" or a "b" tended to be the same regardless of language, but with the 
CJK languages, something like 手紙 was coded differently for 
every language. If you were mixing, say, Japanese and Korean in a
document and doing a string search you could find yourself in a tizz.
Of course font-rendering in mixed-code system is a nightmare. Remember
that one of the groupings driving Unicode was the collection of computer
companies: Sun, IBM, Apple, Microsoft, etc. To them it was a real
problem that needed fixing.

>> Given that, what rationale went into the decision to
>> combine certain glyphs between countries that cause caused so much grief among
>> your opponents? It's easy to dismiss Unicode opponents as nationalist
>> counter-revolutionaries, but it isn't clear to me (yet) that the Unicode camp
>> has addressed their grievance adequately.

You can only go so far addressing irrationality. With people saying that
a 十 (kanji) can't be unified with a 十 (hanzi) because
one is essentially Japanese and the other irrevocably Chinese, no
addressing is possible, short of abandoning the whole process. Imagine
if the French and the Italians demanded their own alphabets as a matter
of national pride and identity.

One argument mounted by the anti-Unicode people was that it was "unfair"
to unify hanzi/kanji/hanja when the "Latin" and "Greek" portions of
Unicode retained distinct identical characters (e.g. A.)  In fact the
"Source Separation Rule" I mentioned before, which was brought in at the
insistence of the CJK countries, requires this to be the case. JIS X
0208 has two identical letter A codings: A (2341) and 
Α (2621), not to mention А (2721), which is almost the
same.

Cheers

Jim

-- 
Jim Breen (j.breen(a)csse.monash.edu.au  http://www.csse.monash.edu.au/~jwb/)
Computer Science & Software Engineering,                Tel: +61 3 9905 3298
Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
(Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大学

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links