Re: tlug: A couple of questions about Unicode

To: tlug@example.com
Subject: Re: tlug: A couple of questions about Unicode
From: Jon Babcock <jon@example.com>
Date: 10 Jan 1998 03:30:23 -0700
Cc: michael@example.com
In-Reply-To: Taro Yamamoto's message of Sat, 10 Jan 1998 16:03:40 +0900
References: <Pine.LNX.3.96LJ1.1b7.980110093817.18865A-100000@example.com> <34B71D4C.1684ACAD@example.com>
Reply-To: tlug@example.com
Sender: owner-tlug@example.com

>>>>> "TY" == Taro Yamamoto <tyamamot@example.com> writes:

    TY> Craig Oda wrote:

    >> that there is a Japanese book out about how bad unicode is for
    >> the Japanese.  Evidently, it was a best seller in Japan.

First, does anyone have the title or any bibliographic info on this
book?

                           --- --- ---

I think Yamamoto-san has correctly identified two of the most
important issues regarding kanji encoding in Unicode:

1)
    TY> It means that each character defined in such standards is not
    TY> a "representation (instance)" but a "prototype" of the
    TY> character as functional and semantic information unit.
        <snip>
    TY> All such talks about "representation" model of characters,
    TY> glyphs and fonts is beyond the scope of Unicode and character
    TY> set standards (this is a very important point when one comments
    TY> on a character set standard such as Unicode).

To understand Unicode, it is essential to make this distinction. Each
kanji included in the Unified Han Repertoire is, or should be, a
prototype, an "abstract class", if you will, and not a concrete
instantiation of that class. The task of instantiation or
representation is left to the font makers. An instance of the class, a
representation of the prototype, is usually referred to as a
"glyph". This is becoming better understood, but in the beginning of
the development of Unicode failure to clearly understand value of this
distinction, and to apply it, was the cause of much confusion it seems
to me.

But, as Yamamoto says, 

    TY> Unicode, it is based on source character sets (such as JIS X
    TY> 0208 and 0212)

Unicode was based on *existing* character sets. To the examples just
mentioned, GB and Big5 can be added, and there were others.

In short, Unicode attempted the impossible. On the one hand there was
the laudable goal of compiling a list of the minimum number of kanji
prototypes, no two of which would be the same, that could be mapped to
all or nearly all of the kanji glyphs (representations of the
prototype that can be seen with the eyes and not merely conceived of
in thought) actually in use in kanji-using scripts (CJK). On the other
hand, there was the perceived necessity (politics played a role here)
of accommodating existing character sets already in use by
computers. Unicode made a gallant attempt to reconcile these opposing
forces, but the result is a compromise, albeit a rather practical one,
IMO.

So, in no small number of cases, we have *more than one Unicode
character for the same prototype*. What in reality is merely a glyph
variant, an alternate form of instantiation, is incorrectly elevated
to the status of a prototype, an abstract class. Moreover, on the
other hand, due to inherent limitations of the existing source
character sets, Unicode provides *no prototype at all for most of the
glyphs represented in the traditional repertories*, such as the Kangxi
Dictionary, or in more up-to-date versions of those, such as the large
Morohashi kanwa dictionary. It could be argued that Unicode should
have concentrated on developing a real unified Han character set of
its own and forgot about accommodating the existing national sets. But
the counter-argument here might then be that if such were the case,
Unicode would never have been more than a academic exercise unable to
gain a toehold in the "real world".

Unicode appears to be trying to achieve a balance between including in
its list just the minimum elements (graphemes) of a script that are
needed to write it, and thereby handing on the task of composing those
graphemes, of rendering those graphemes, to the OS or the application,
and providing a certain amount of that composition service ready-made
within itself. (The big addition, in Unicode version 2, of the Hangul
composite characters stands as a good example of the later.) Although,
this sort of muddy mixture, of compromise, does not appeal to the
purists (me included), it may be that it is the only approach that had
any chance of acceptance, under current conditions. I don't know. 

Jon Babcock
jon@example.com

---------------------------------------------------------------
Next TLUG Nomikai: 14 January 1998 19:15  Tokyo station
Yaesu Chuo ticket gate.  Or go directly to Tengu TokyoEkiMae 19:30
Chuo-ku, Kyobashi 1-1-6, EchiZenYa Bld. B1/B2 03-3275-3691
Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station
Yaesu Chuo ticket gate.
---------------------------------------------------------------
a word from the sponsor:
TWICS - Japan's First Public-Access Internet System
www.twics.com  info@example.com  Tel:03-3351-5977  Fax:03-3353-6096

Follow-Ups:
- Re: tlug: A couple of questions about Unicode
  - From: Taro Yamamoto <tyamamot@example.com>

References:
- Re: tlug: A couple of questions about Unicode
  - From: Craig Oda <craig@example.com>
- Re: tlug: A couple of questions about Unicode
  - From: Taro Yamamoto <tyamamot@example.com>

Prev by Date: Redhat 5.0 (was tlug: various stuff)
Next by Date: Re: tlug: various stuff
Prev by thread: Re: tlug: A couple of questions about Unicode
Next by thread: Re: tlug: A couple of questions about Unicode
Index(es):
- Date
- Thread

Home | Main Index | Thread Index