Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: A couple of questions about Unicode



Here are some notes I wrote a while ago on the subject of Unicode and
conversion to other Japanese encodings.  I'd appreciate comments from
others with experience in this.  Hope it helps.

- Ken

------

Unicode is supposed to be the ultimate encoding by providing a uniform
standard to handle every language in the world.  Java is helping make
this a reality.  Unicode comes close to being everything for everyone
in Japan, but it is flawed in some minor and not-so-minor ways.

The biggest problem with Unicode is that vendors have not implemented
it the same way.  For example, the mapping from SJIS<->Unicode is
defined slightly differently by the Sun Java VM on Windows NT than
Microsoft NT handles it internally.  So much for universality.
Because these differences appear not only on different platforms, but
within different applications running on a given platform, it is
impossible to provide comprehensive handling of Unicode conversion.
What we propose is a canonical converter (which is correct for 99.99%
of a user's needs) plus the ability to override the canonical
converter in a generic way.  We can provide information on override
values in FAQs or documentation and update them as new issues arise.

Among the minor issues, the most important is that Unicode omits a
range of SJIS characters found only in Microsoft Windows.  While
useful and fairly popular (such as circled roman numerals), the
Unicode committees did not recognize them as legitimate characters,
and so they do not have Unicode mappings.  These characters are
deprecated everywhere so we should see less and less of them as time
goes on.

The second is that there are a handful of cases of pairs of SJIS
characters which map to a single character in Unicode.  This class of
exceptions is considered extremely minor in practice, and is the
result of different editions (1983 and 1990) of the Japanese character
standards as the basis of SJIS and Unicode.  This seems to have no
practical impact on the use of Unicode in Japan and is objectionable
primarily to some academics involved in the standards committees.

It's also worth pointing out that many people confuse "Unicode" and
"ISO 10646".  Unicode is the equivalent of a UCS-2 Level-3 ISO 10646
implementation.  UCS-2 means all data is managed in 2-octet or 16-bit
words (vs. the 4-octet or 32-bit words of UCS-4).  However, Level-3
means that characters may be combined without restriction, so it is
wrong to assume that all characters are expressed in 16-bits.  The
"ch" and "ll" characters in Spanish, for example, are considered
single characters of 4-octets.  Unicode is not simply a wide-char
version of 8-bit char data; it is a multibyte encoding.  In this way,
going with Unicode to avoid the complexities of multibyte handling is
misguided.

In practice, though, it looks like 16-bit and Unicode are becoming
synonymous since Microsoft and Java treat them that way.

Since Unicode is a 16-bit quantity, byte order depends on platform
architecture.  On little-endian systems (Intel), the low-order byte
comes first whereas on big-endian systems (Sparc, HP, Mips) the
high-order byte is first.  Java Unicode is always big-endian, even on
Windows machines!
---------------------------------------------------------------
Next TLUG Nomikai: 14 January 1998 19:15  Tokyo station
Yaesu Chuo ticket gate.  Or go directly to Tengu TokyoEkiMae 19:30
Chuo-ku, Kyobashi 1-1-6, EchiZenYa Bld. B1/B2 03-3275-3691
Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station
Yaesu Chuo ticket gate.
---------------------------------------------------------------
a word from the sponsor:
TWICS - Japan's First Public-Access Internet System
www.twics.com  info@example.com  Tel:03-3351-5977  Fax:03-3353-6096



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links