Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: unicode



--------------------------------------------------------
tlug note from Ken Schwarz <kls@example.com>
--------------------------------------------------------
Steve Turnbull wrote:
> It's complicated, like most of this stuff.  For one thing, there's
> ISO-10646 and there's Unicode.  Unicode is a subset of full ISO-10646
> which fits into 16bits and covers all of the world's character sets
> more or less, at the cost of getting glyphs wrong.  Ie, you probably
> are aware that Chinese Chinese characters look different from Japanese 
> Chinese characters, which are different again from Korean Chinese
> characters, even when they're the same.  Unicode coerces them all into 
> the same character set, at the cost of forcing a single font.  So
> Japanese Unicode fonts will look like Japanese even when displaying
> Chinese.  Readable (if you can read them), but not pretty (according
> to natives).

I thought that the controversy about Unicode was a much narrower
issue.  The "Han Unification" process was supposed to organize
characters on three axes:  meaning, shape, and style.  A character
code in the Unicode space is assigned only to characters of distinct
meaning or shape.  Differences of style are typeface differences.
This is analogous to the difference between Times Roman and Courier.
Both fonts contain the character "lowercase a" but the glyphs which
represent these characters are completely different.  What Unicode is
supposed to do is to give you a universal character assignment scheme, 
not a universal font.  If you want Chinese and Japanese, you still
need two fonts, but only one symbol set map.  You shouldn't need to
switch fonts if you are using one language, though.

It's true that Unicode omits a range of SJIS characters found only in
Microsoft Windows.  While useful and fairly popular (such as circled
roman numerals), the Unicode committees did not recognize them as
legitimate characters, and so they do not have Unicode mappings.
These characters are deprecated everywhere so we should see less and
less of them as time goes on.  (I heard that Mac users refer to these
characters as "goma dofu" since they are just filled blocks on the Mac
screen.  If you think that the SJIS "pseudo-character" problem is bad,
try Korean.  I heard that there are hundreds of codes which are really
graphic characters (hearts, diamonds, etc.))

>From what I've been able to gather, the controversy about Han
Unification seems to come from two places.  The more significant one
is where to draw the line between difference of structure and style.
If your name happens to use one of those characters which the Unicode
people say is the same as another character, you may be out of luck.
So, John Smyth becomes John Smith.  One explanation I read was that
this is a result of different editions (1983 and 1990) of the Japanese
character standards as the basis of SJIS and Unicode.  But I haven't
been able to find any specific examples of this problem that I can
check out myself.

The other controversy is how the characters should be classified once
they are inside the Unicode space.  They're grouped by structure, but
since there are different ways to analyze a character, not everyone
agrees which character goes where.  Much of the Han Unification work
was done in China, and so the Chinese approach dominates.  Some
Japanese academics aren't pleased.

> To fix this, the full Universal Character Set uses 4 bytes, and allows 
> the Japanese to use JIS code and the Taiwanese GB and so on.

Where did you hear this?  I thought that Unicode is the equivalent of
a UCS-2 Level-3 ISO 10646 implementation.  UCS-2 means all data is
managed in 2-octet or 16-bit words.  UCS-4 is a 4-octet (32-bit)
encoding, but currently (as of 1994, at least) UCS-2 and UCS-4 encode
characters the same way.  Is the UCS-4 set now defined to encode JIS
characters?  Can you point me to a reference about this?

> 16-bit Unicode requires translation tables, because (a) the various
> languages don't even agree on the "basic 1000" and (b) they don't
> order the ones they do agree on the same (eg, JIS orders level 1 kanji
> by yomi but Chinese orders all hanzi by radical and stroke count).

Yes, a translation table is required for Japanese.  Is none required
for Chinese?  Is that what people are upset about?  (It's a legitimate 
beef since the tables can consume 60Kb+).

> ISO8859-1 doesn't require translation tables, because it's mapped into 
> Unicode with the high byte = 00.  But other ISO-8859-* sets will
> require some translation to avoid redundancy.

I am aware that the 0-127 range of single-byte SJIS characters are
"JIS-Roman", *not* ASCII, even though they look much like ASCII.
Specifically, the characters {'\', '~' , '|'} are different.  The
practical significance is that the map of characters [0-127] from
ASCII->Unicode->SJIS is not an identity.  Is this what you are
talking about?

> Unicode-compliant mostly means not trashing 16-bit codes.  This is
> mostly a problem for C-like string processing, I believe.

Any string processing that counts on NULL-terminated strings have to
be re-written to work with 16-bit units.

You have to watch out for endian problems.  Unicode is officially
big-endian, but Windows, at least, stores Unicode values in
little-endian form internally.  Unless, of course, you're running
Java...then Unicode is big-endian, even on Intel machines.

Unicode is UCS-2 Level-3.  "Level 3" means that characters may be
combined without restriction, so it is wrong to assume that all
characters are expressed in 16-bits.  The "ch" and "ll" characters in
Spanish, for example, are considered single characters of 4-octets.
The bottom line is that the idea that 16-bits is *not* the same as a
"Unicode" character, even though Windows NT and Java seem to treat
them that way.  I haven't been able to get much information about
this, but it sounds like a serious gotcha since "universal" text
processing must be sensitive to this.  As far as I can tell, though,
this issue has nothing to do with Japanese or other kanji fonts.  But
the idea that Unicode means all characters (in the semantic sense) are
16-bits is wrong.  Oh, well.

- Ken


-----------------------------------------------------------------
a word from the sponsor will appear below
-----------------------------------------------------------------
The TLUG mailing list is proudly sponsored by TWICS - Japan's First
Public-Access Internet System.  Now offering 20,000 yen/year flat
rate Internet access with no time charges.  Full line of corporate
Internet and intranet products are available.   info@example.com
Tel: 03-3351-5977   Fax: 03-3353-6096


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links