Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Re: Unicode



Shimpei Yamashita wrote:

> But that, and combining kanji glyphs, seem to be orthogonal problems to me.
> In different CJK nations, they don't necessarily look the same, they aren't
> read the same, and they don't even always mean the same. If all you wanted to
> do was to create a coding standard in which no two languages ever clashed with
> each other, you could have given each language's glyphs different coding
> points. So why was this not done? I'm sure there were good rationales behind
> it--coding point economy? ease of lookup?--but it doesn't lead automatically
> from Unicode's goal as you stated it.

One could obviously take different positions on this--as you make clear, it
is not a cut-and-dried point. But I would still argue (as obviously, any one
else who supports Unicode's efforts would) that there is clearly enough
historical, semantic, and graphical commonality to justify the attempt to
combine them into a single code set. This is something that most Chinese
character experts that I know will agree on without question. As you point
out, there *are* some characters that carry different meaning in the various
East Asian languages, but certainly not--I would argue--to the extent to
necessitate the creation of three entirely separate code areas.

The other problem here, complicated by the history of hanji glyphs, is that
you simply can't separate most characters as being particularly Japanese,
Chinese, Korean, etc. (except for a small amount of _kokuji_ and the
equivalent sort of thing in other countries). What happens here is that the
matter becomes complicated in a manner that can't simply be resolved by
literary scholar or by computer programmers.

Take, for example, some of the classical texts that are considered to be
foundational for the study of East Asian culture, such as the _Analects_
(Rongo, Lunyu), Daodejing, the Buddhist canon, the Chinese histories, and so
forth. While these originated in China, they are part and parcel of
Vietnamese, Korean, and Japanese culture. Should each of the characters in
each of these texts receive a different code point? That is terribly
redundant, not to mention all the problems that would result with trying to
do searches and so forth when people try to share data. Sure, mapping tables
could be created, but that would probably end up being more extra work that
just trying to resolve a relatively few sticking points in an attempted unification.

There also were problems regarding coding point economy that were the result
of pure planning, perhaps, but also a perception of how computing was going
to evolve. In the hexidecimal system used for two-byte encoding, you could
only have 65,000 or so code points to share among all the languages of the
world. They couldn't just assume the ability to do everything in four byte
encoding since, although that would easily cover any character of any
language, living or dead, there would be no software that would be able to
support it (and practically speaking, there still isn't--right?). So if you
are going to dedicate an area of say 30,000 code points that you would
realistically need to cover enough characters for each of the East Asian
countries in any sort of comprehensive way, just make it two languages, and
you're already out of space. So I guess you can say there is a problem of
economy, or at least there was.

> As an academic, I'd hope you're above hand-waving problems away as "very
> easy" when you're explaining things to an amateur. The fact that you need font
> information, as well as coding information, in order to have a completely
> accurate rendition of the intended text implies that you're trying to
> hand-wave away a fundamental problem: Unicode is less information-complete in
> representing Japanese text than ISO-2022, EUC-JP, etc. So the Unicode
> consortium made a sacrifice. 

This may be so, but I wonder how much less information-complete it is, and
if it may not well be the case that someone is already working on adding
that information.(?)

> OK, that was the process. However, as an end-user (aka luser), I don't give a
> cow what the *process* was; I care about the *results*, because the results
> are what I will be using every day, not the process. If you want to present
> Unicode as an acceptable alternative for expressing Japanese, it needs to be
> done in ways other than "well, too bad about your problems, because you people
> were un-cooperative while we were working on it and we figured they were minor
> anyway" (sorry for paraphrasing your argument, but I can't draw any other
> conclusion from what you've said so far). Again, what I'd like to hear is how
> sacrificing expressivity in certain fringe cases made Unicode better as an
> overall product.

To take the argument the other way, I don't see where it has fallen so short
as compared to what existed previously. What can't we do now that we could
do before?

> If I hadn't known beforehand that you are a well-intentioned person, this
> sentence would have made me very angry. You know you're talking to a
> properly non-accredited individual who has no inside contacts; ergo, in lack
> of other information, the above paragraph is exactly equivalent to "plebian,
> the Unicode consortium is not interested in your needs or thoughts"! I assume
> that wasn't the precise message you wanted to convey, though. What did you
> really want to say?

I see how this comes off. It wasn't intended that way, and thanks for
leaving me a way out. My point was that I wanted to make it clear that every
Tom, Dick, and Harry can't just send in suggestions for adding to the
character set and expect them to be accepted. You need to be involved with
some sort of group that has some sort of official recognition. It has to be
that way, because now that Unicode is gradually coming into acceptance,
there are new proposals coming in all the time.

That being said, if you really felt you had something to contribute, you
need not necessarily be an academic (I guess it wouldn't hurt, of
course). You can become a consortium member yourself and found out what
sorts of groups are making proposals, and tried to get involved. I have been
involved as an observer in some JIS and KSC (Korean Standard) meetings on
CJK, and I'll tell you, I think that the process as it is handled by Unicode
is far more transparent than it is for the local national standards.

> > I don't say that Unicode is problem-free. But I can tell you that people
> > like myself who work with classical East Asian literary texts would still be
> > in the dark ages if Unicode had not come along. Maybe some day in the future
> > Unicode will be replaced with something better, and if so, that's fine. But
> > to have left things in the fragmentary form they were would have been
> > absurd.
> 
> I may have missed your previous posts on this before, but how exactly does
> Unicode help you? And would the matter of combining code point have any effect
> on your work? I'd be curious to know.

In my work as a student of Asian history, literature, and thought, the
documents that I deal with in my research, the papers I write, and the
dictionaries I am composing, contain scripts from not only Japanese,
Chinese, and Korean (not just the CJK--also the kana, hangul, etc), but all
sorts of other languages--European and Asian, including Sanskrit
(using devanagari and such scripts), and all the odd diacritics for rendering all of
these languages into roman script. All of the diacritics contained in the
"Latin Extended Additional" area of Unicode were not in any of the previous
national character sets.

Before they came out with Unicode 1.0, if I had written a paper in JIS, I
couldn't even send it to a colleague who was using a Korean or Chinese OS
and have him/her be able to read any of the kanji--even though their systems
had all the kanji in their own code. And even for basic
coverage of Japanese, or East Asian historical terms, the JIS (even JIS
0212) was just too narrow in coverage. If I wanted to use JIS, I couldn't
include hangul; the coverage of diacritics was very limited. The same sorts
of problems existed if I wanted to work in Big5, KSC, or whatever. Of course
I could encode HTML with entities, but that was a huge headache and almost
impossible to do searches on.

For someone like Jim Breen, who was focusing, especially in the beginning,
on Japanese-oriented materials, this was not such a pressing problem, I
think, but for people like myself who need to work in a variety of
languages, using Unicode 1.0 for the first time was like being in
heaven. Without it, the dictionaries I am building would be an
impossibility, not just at the level of editing and storage, but simply in
terms of receiving data from contributors. As much as I hate M$, the fact
that everyone around the world can send me data in Word files that I can
save as XML-Unicode text, makes the difference between projects based on
international collaboration doable or not.

I don't see Unicode as a perfect solution, but I do think it was a step that
needed to be taken. It should be criticized so that it can grow in the right
way, but the kind of "xenophobic" (as Jim puts it) resistance that we often
see in Japan continues to give the impression that it was arbitrarily forced
on people and that nothing can be done to improve it. That's the main point
I want to refute.

Chuck

---------------------------
Charles Muller  <acmuller@example.com>
Faculty of Humanities,  Toyo Gakuen University
Digital Dictionary of Buddhism and CJKV-English Dictionary [http://www.acmuller.net]
H-Buddhism List Editor [http://www2.h-net.msu.edu/~buddhism/]
Mobile Phone: 090-9310-1787


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links