Re: [tlug] Unicode

Date: Sat, 12 Jul 2003 21:46:37 +1000 (EST)
From: Jim Breen <jwb@example.com>
Subject: Re: [tlug] Unicode
I'll try and pick up several sets of comments.

simon colston <simon@example.com> wrote:

>> I have to agree with this.  If you want to create a Japanese-Chinese
>> dictionary in Unicode then there need to be separate codes for each
>> character that looks the same to a Westerner's eyes but very different to
>> the Japanese and Chinese.  

No you don't. There are ways of flagging language, either in markup
(which is how I'd prefer it), or by embedded language-codes. There is
no need to have different code-points for what are very minor glyph
differences. In any case, the characters that actually differ in glyph
are a very small proportion, and involve things like whether in
characters like 藤 the kusakanmuri covers the whole character or
is slid a fraction to the right. The differences between Courier and
Helvetica are more substantial than this.

>> I think a lack of sensitivity to these types of
>> problem are a bigger problem than a "nationalistic" desire to have one's
>> own language look like one's own language.

There is no "lack of sensitivity". That argument is quite bogus. The
people who claim that for Unicode are either under the misapprehension
that Unicode mandates glyphs (it doesn't), or are aware that it doesn't
and are saying it on the FUD principle.

Charles Muller <acmuller@example.com> wrote:
 
>> As far as I can tell, it is because the grievances are largely based on
>> misunderstandings of what Unicode is supposed to do. Almost all of the
>> grievances that I have heard from anti-Unicode people have been quibbles
>> about small, idiosyncratic differences in glyph representation, which can
>> very easily be handled at the level of font, and thus there is no problem
>> assigning a single code point.

This is it in a nutshell. Unfortunately the first edition of Unicode
was published using Chinese fonts for some of the more obscure
characters. That set the xenophobes running with a "foisting foreign
characters on us" argument. The JSA committee did the right thing but
putting multiple glyphs in JIS X 0221, but you still here the "Unicode
looks Chinese" argument, which is a total furphy.

>> There are of course a very small percentage of _bimyou_ cases where
>> expert-level debate needs to take place to determine whether or not a
>> character is a variant of another (and if so, what kind of variant). But the
>> fact that more of these did not get hashed out at the early stages is again,
>> from what I understand, due more to the problems of non-cooperation rather
>> than unawareness or arbitrary forcing on the part of the Unicode consortium.
>> 
>> The other thing that I would like to stress is that from the early days up
>> to the present, the Unicode consortium has been quite open to suggestions
>> and reasonable proposals set forth by properly accredited groups and
>> individuals, and therefore the Unicode character set continues to grow and
>> be refined.

Quite. I strongly recommend that people track down a copy of the
overview of Han unification in the Unicode documents. 

Shimpei Yamashita <shimpei@example.com> wrote:
 
>> But that, and combining kanji glyphs, seem to be orthogonal problems to me.
>> In different CJK nations, they don't necessarily look the same, they aren't
>> read the same, and they don't even always mean the same. 

In which case they won't have been unified. Please read up on the
unification process. It was done very carefully. There is a "semantic
axis" which had to be satisfied as well as shape before unification took
place.

>> If all you wanted to
>> do was to create a coding standard in which no two languages ever clashed with
>> each other, you could have given each language's glyphs different coding
>> points. 

Yes, a French "A" and a German "A" and an English "A".

What do you mean by "no two languages ever clashed"? The process was
about codesets; not languages.

>> So why was this not done? I'm sure there were good rationales behind
>> it--coding point economy? ease of lookup?--but it doesn't lead automatically
>> from Unicode's goal as you stated it.

I really don't understand the point you are trying to make. Are you
saying because 手紙 means letter in Japanese and toilet paper in
Chinese they must be written with different codes? If so, "vent" in
English and "vent" in French had better be too, because they don't mean
the same thing.

>>  Unicode is less information-complete in
>> representing Japanese text than ISO-2022, EUC-JP, etc. So the Unicode
>> consortium made a sacrifice. 

What is missing from Unicode that is present in EUC-JP? (i.e. JIS X 208
or JIS X 212)

>> I'm not saying that this is necessarily bad; clearly some smart people thought
>> this was acceptable, or possibly good. So what was the rationale behind the
>> sacrifice? By rationale, BTW, I'm asking how that particular decision to make
>> a sacrifice made Unicode a *better* product than if you chose not to combine
>> the code points--surely *something* must have been gained in exchange for even
>> a very tiny step backwards for Japanese expression.

What sacrifice? Examples, please.

>>  Again, what I'd like to hear is how
>> sacrificing expressivity in certain fringe cases made Unicode better as an
>> overall product.

And I want to hear what these sacrifices were. AFAIK *every* kanji in
JIS208, JIS212 and (now) JIS213 is in Unicode. Where is the sacrifice?

>> I may have missed your previous posts on this before, but how exactly does
>> Unicode help you? And would the matter of combining code point have any effect
>> on your work? I'd be curious to know.

Well, Chuck can answer this, but as someone who has used Chuck's
collections of Buddhism information, I must say Unicode has helped it
immensely. I was poking around in that material before Unicode came
along and it was hack upon hack to try and get a meaningful
representation of a large set of characters; none of which could all be
found in any single standard; Chinese, Japanese or Korean. Unicode was
the answer to the maiden's prayer for that application.

simon colston <simon@example.com> wrote:

>>  If the same document contains 2 characters with the
>> same code point how do you specify that one should be displayed as a
>> Chinese character and the other as a Japanese character?  

By flagging the language or by embedding language codes. No great issue,
any more than <it>italic</it> is.

>> What
>> I meant by my post is that I think it is perfectly reasonable for a
>> Japanese person to want printed Japanese to look like Japanese always has
>> done and not have to compromise[1].  

I quite agree, and Unicode by its existence does not prevent this from
happening. The font foundry companies might try and cut corners and have
glyphs drift towrds each other, just as happened in Europe where things
like Gothic fonts have declined in favour of the ubiquitous "Carolignian
half-uncials", but that happened independently of code-sets.

>> And I don't think that not
>> compromising should be interpreted as some sort of nationalistic pride.

Again I agree, but in effect no compromising took place for the
Japanese. Every character in the Japanese national standards was
incorporated into Unicode. Even the bogus ones resulting from blotches,
smudges and scribbles when the first JIS standard was assembled went in.
The "compromises" such as they were only involved really obscure
characters that you'd have trouble finding in Morohashi.

>> Note:
>> [1] I know you said this is a font issue but I still don't understand how
>> you display Japanese and Chinese together and have the same code point
>> displayed differently.

See my earlier comments.

Jim

-- 
Jim Breen (j.breen(a)csse.monash.edu.au  http://www.csse.monash.edu.au/~jwb/)
Computer Science & Software Engineering,                Tel: +61 3 9905 3298
Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
(Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大学
Follow-Ups:
- Re: [tlug] Unicode
  - From: simon colston
- Re: [tlug] Unicode
  - From: simon colston
Prev by Date: Re: [tlug] Re: Unicode
Next by Date: Re: [tlug] Using Linux for the desktop
Previous by thread: Re: [tlug] oasys -> linux
Next by thread: Re: [tlug] Unicode
Index(es):
- Date
- Thread
Home | Main Index | Thread Index