Re: tlug: unicode

To: tlug@example.com
Subject: Re: tlug: unicode
From: "Stephen J. Turnbull" <turnbull@example.com>
Date: Tue, 27 May 1997 15:22:11 +0900
In-reply-to: Your message of "Mon, 26 May 1997 20:56:54 -0400." <199705270056.UAA27649@example.com>
Reply-To: tlug@example.com
Sender: owner-tlug
--------------------------------------------------------
tlug note from "Stephen J. Turnbull" <turnbull@example.com>
--------------------------------------------------------
>>>>> "Ken" == Ken Schwarz <kls@example.com> writes:

    Ken> Steve Turnbull wrote:
    >> Unicode coerces them all into the same character set, at the
    >> cost of forcing a single font.  So Japanese Unicode fonts will
    >> look like Japanese even when displaying Chinese.  Readable (if
    >> you can read them), but not pretty (according to natives).

    Ken> A character code in the Unicode space is assigned
    Ken> only to characters of distinct meaning or shape.  Differences

And the "source separation" rule.

    Ken> of style are typeface differences.

True.  This is important to users, however.  Otherwise Adobe wouldn't
be in business.

    Ken> If you want Chinese and Japanese, you still need two fonts,
    Ken> but only one symbol set map.

I don't think so.  For example, suppose I'm grepping for all the
Japanese words in a Chinese-language nihongo textbook.  Note that
properly done, that text book probably has separate Chinese and
Japanese fonts for the same character depending on which language it
occurs in.  Given the stylistic difference mentioned above, the human
eye can pick these things out immediately.  Given a 31-bit code space,
a UCS-4 grep can too.

There are other ways to do this, of course.  For example, you can put
in language tags (escape sequences).  So now Unicode, for this
purpose, looks like ISO-2022.  Excuse me for not being thrilled :-)

This kind of multilingual issue is not a huge deal for most people, of
course.  But then go a little farther: for most people, Shift-JIS does
just fine.  I think we've just reached Jim Breen's limit of tolerance.
No JIS X 0212.  :-)  Unicode basically matters in really multilingual
environments (like language textbooks and comparative philology) and
for systems implementors.  Systems implementors are not going to get
acceptance for their systems unless they can (a) impose the system on
the users (not likely, that's one reason why JIS C 6226-1978 was far
less successful as a standard - ie, basically ignored by implementors -
than JIS X 0208/0212 - the users wanted _thousands_ of extra
characters), or (b) people who will use the multilingual features get
the features they want.

    Ken> From what I've been able to gather, the controversy about Han
    Ken> Unification seems to come from two places.  The more
    Ken> significant one is where to draw the line between difference
    Ken> of structure and style.  If your name happens to use one of
    Ken> those characters which the Unicode people say is the same as
    Ken> another character, you may be out of luck.

I think this is probably no longer an issue, and definitely not for
Japanese, because of the "source separation rule" (if the characters
are separate in a single source character set, say JIS X 0208, they
should not be unified to allow round-trip conversions).  This style
issue, too, is really only an issue to the academics you mention
below.  And maybe the Chinese (especially the Taiwanese), who use a
lot more different kanji than the Japanese do, and thus are more
likely to end up "outside" of the standard.  (According to what I
heard at the M17N conference, the Chinese National Standard is likely
to end up with 80,000 characters in it!  Some of them created
specially for the purpose, apparently ;-)

    Ken> The other controversy is how the characters should be
    Ken> classified once they are inside the Unicode space.  They're
    Ken> grouped by structure, but since there are different ways to
    Ken> analyze a character, not everyone agrees which character goes
    Ken> where.  Much of the Han Unification work was done in China,
    Ken> and so the Chinese approach dominates.  Some Japanese
    Ken> academics aren't pleased.

    Ken> Yes, a translation table is required for Japanese.  Is none
    Ken> required for Chinese?  Is that what people are upset about?
    Ken> (It's a legitimate beef since the tables can consume 60Kb+).

No, the Chinese do, too.  As I recall from Lunde's book, the
compromise is that all Unicode Han are ordered first by radical then
by total stroke count.  This makes the Chinese unhappy, because they
do it the opposite way at least for some of their national standards.
The Japanese are also unhappy, since they order by yomi and not by
structure at all, for JIS Level 1.

The Japanese can't complain too much though, since the order of
precedence for Han unification is "common tradition" (I guess in
practice that means Chinese, mostly), then Japanese, Chinese, Korean
(according to Lunde's book).  This mean that nobody is going to get
their way; somebody else's characters are going to keep cropping up
in your ordering.

    >> To fix this, the full Universal Character Set uses 4 bytes, and
    >> allows the Japanese to use JIS code and the Taiwanese GB and so
    >> on.

    Ken> Where did you hear this?  I thought that Unicode is the
    Ken> equivalent of a UCS-2 Level-3 ISO 10646 implementation.
    Ken> UCS-2 means all data is managed in 2-octet or 16-bit words.
    Ken> UCS-4 is a 4-octet (32-bit) encoding, but currently (as of
    Ken> 1994, at least) UCS-2 and UCS-4 encode characters the same
    Ken> way.  Is the UCS-4 set now defined to encode JIS characters?
    Ken> Can you point me to a reference about this?

Sory, no reference.  I heard it at the M17N conference in Tsukuba at
the end of March in Ken Handa's talk on merging Mule with GNU Emacs.
Proceedings remain unpublished.  UCS-4 does _not_ encode characters
the same way, since it's a 31-bit encoding.  On the other hand, as far
as I know what is so far implemented and approved for UCS-4 is the
basic multilingual plane, which is the trivial remapping of UCS-2 into
the UCS-4 code space.  "Allow" means "allow," I should have been more
careful about implying that it was actually implemented.

The point about UCS-4 is that the code space is so big (2^15 times as
big as UCS-2) that, for example, Mule could take some of the code
space of UCS-4 and implement exactly the extract-Japanese-from-
mostly-Chinese text regular expression parser I mentioned above, 
_which it already has implemented_ using Mule's internal CCS system, 
_and_ at a one-time cost of translating from CCS to UCS-4 be ISO-10646
compliant once the Mule internal character set was registered as part
of ISO-10646.  If I understood Handa correctly, this is what GNUMule
is heading toward.

Since the whole point of the technique would be baka-hodo
inclusiveness ("A foolish consistency is the hobgoblin of 16-bit
minds" - Emerson) you'd only have to do it once.  Once standardized
with an appropriate procedure for extension to new languages,
everybody could use Mule codes for this kind of purpose.

    >> ISO8859-1 doesn't require translation tables, because it's
    >> mapped into Unicode with the high byte = 00.  But other
    >> ISO-8859-* sets will require some translation to avoid
    >> redundancy.

    Ken> map of characters [0-127] from ASCII-> Unicode->SJIS is not
    Ken> an identity.  Is this what you are talking about?

Nope, I'm talking about Hebrew.  Translation is still trivial as far
as I know (is it ASCII? yes -> print it, no -> += Hebrew_offset and
print it), but needs to be done for backward compatibility.  Or
Vietnamese, which uses 223 (or so) of the 256 possible 8-bit code
points---that's tight when you consider that there are 33 control
characters (0-31 and 127).  This is no longer a simple "map to a new
GR table" translation, as the ISO-8859 family is.

    Ken> Unicode is UCS-2 Level-3.  "Level 3" means that characters
    Ken> may be combined without restriction, so it is wrong to assume
    Ken> that all characters are expressed in 16-bits.  The "ch" and
    Ken> "ll" characters in Spanish, for example, are considered
    Ken> single characters of 4-octets.

You're kidding.  That's absolutely farcical, if true.  They should be
considered (and represented as) single 2-octet characters with double
width when displayed.  For goodness' sake, that's basically what the
thousands of combined Hangul glyphs are!  Can't they spare 2 or 3 code
points for the Spanish?

The point is that display routines have to be complicated with all
sorts of special processing anyway.  Think about hyphenation,
proportional spacing, sizes, faces, colors, etc.  This example is
trivial with proportional fonts, and assuming a monospace font, you
only need to use a peephole filter to catch the few characters that
require two glyphs for printing), anyway.  _Text in RAM_ should be
uniform, to facilitate grepping and stream processing in general.

>>>>> "Jim" == Jim Breen <jwb@example.com> writes:

    Jim> The *real* reason for translation tables is the lack of
    Jim> Unicode-ordered fonts. Eventually you'll get a set of fonts
    Jim> for the ~21,000 kanji/hanzi/.., with probbaly
    Jim> Japanese/Chinese/Korean/Vietnamese flavours of glyph designs.

E ... ven ... tually, yes.  In the short term, unless automatic design
improves vastly over "Multiple Master" technology (ie, by building up
glyphs from radicals and other components, instead interpolating
between two hand-crafted fonts), I don't think so.  That requires that
Morisawa is going to design lots of pretty glyphs for characters most
Japanese don't know exist.  No demand, no supply.

At least for Postscript (does Level 2 require CID/CMap fonts?) 
compatible font encodings, what is most likely IMHO is that the
existing font-substitution capability will be refined so that a font
will be a set of glyph libraries, each library internally indexed by
arbitrary character IDs (CIDs - which will most likely for historical
reasons be ordered according to the national standards, eg, kuten),
and a generalized character map (CMap - currently I don't think the
standard permits what I'm going to propose) which maps a given
character encoding (eg, Unicode, SJIS, EUC) to (library ID, CID)
pairs.  These are then looked up as at present.

This would then have an XFontList-like user interface, so that
Japanese users of Unicode would use generic "Unicode -> JIS 208" and
"Unicode -> JIS 212" CMaps, backed up by a secondary "Unicode -> Big
5" CMap, finally backed up by a default "Unicode -> Unicode" CMap,
corresponding to fonts like Utsukushii-JISX0208, Mama-JISX0212,
Mou-ii-BIG5, and KimochiWarui-UCS2.  Since these tables are generic,
they don't need to be recreated for every new font.  This provides
backward compatibility with old fonts and programs at the cost of
supplying CMaps (since it would be built into the font engine).

It would also allow you to, say, buy hand-tuned JIS X 0208 fonts at
size and use a scalable JIS X 0212 font for desktop publishing.

This also constitutes a partial solution to the "gaiji problem".  As
well as a complete solution to the "printing from Netscape problem" (I
did the math wrong in my message about Netscape; Linux Netscape
outputs !@#$% Shift-JIS after all!!).  All you need is the SJIS -> JIS
CMap if you have standard Wadalab JIS-encoded fonts.  The "gaiji
problem" also requires some way to arrange for on-demand distribution
of gaiji glyph libraries.  But the encoding problem is reduced to a
CMap to a nearly universal standard, plus a secondary CMap to the
gaiji library.

I know how in principle to implement this as a Type 0 composite font
with Type 4 descendants, even if the current definition of CMap
doesn't implement the library ID.  Unfortunately, I'm not a Postscript 
programmer, so somebody else will surely do it first.  :-)

-- 
                            Stephen J. Turnbull
Institute of Policy and Planning Sciences                    Yaseppochi-Gumi
University of Tsukuba                      http://turnbull.sk.tsukuba.ac.jp/
Tel: +81 (298) 53-5091;  Fax: 55-3849              turnbull@example.com
-----------------------------------------------------------------
a word from the sponsor will appear below
-----------------------------------------------------------------
The TLUG mailing list is proudly sponsored by TWICS - Japan's First
Public-Access Internet System.  Now offering 20,000 yen/year flat
rate Internet access with no time charges.  Full line of corporate
Internet and intranet products are available.   info@example.com
Tel: 03-3351-5977   Fax: 03-3353-6096
References:
- Re: tlug: unicode
  - From: Ken Schwarz <kls@example.com>
Prev by Date: Re: tlug: newbie...well, potential newbie
Next by Date: Re: tlug: newbie...well, potential newbie
Prev by thread: Re: tlug: unicode
Next by thread: Oops, my bad... [was: Re: tlug: unicode ]
Index(es):
- Date
- Thread
Home | Main Index | Thread Index