Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: tlug: unicode
- To: tlug@example.com
- Subject: Re: tlug: unicode
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Date: Tue, 27 May 1997 15:22:11 +0900
- In-reply-to: Your message of "Mon, 26 May 1997 20:56:54 -0400." <199705270056.UAA27649@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug
-------------------------------------------------------- tlug note from "Stephen J. Turnbull" <turnbull@example.com> -------------------------------------------------------- >>>>> "Ken" == Ken Schwarz <kls@example.com> writes: Ken> Steve Turnbull wrote: >> Unicode coerces them all into the same character set, at the >> cost of forcing a single font. So Japanese Unicode fonts will >> look like Japanese even when displaying Chinese. Readable (if >> you can read them), but not pretty (according to natives). Ken> A character code in the Unicode space is assigned Ken> only to characters of distinct meaning or shape. Differences And the "source separation" rule. Ken> of style are typeface differences. True. This is important to users, however. Otherwise Adobe wouldn't be in business. Ken> If you want Chinese and Japanese, you still need two fonts, Ken> but only one symbol set map. I don't think so. For example, suppose I'm grepping for all the Japanese words in a Chinese-language nihongo textbook. Note that properly done, that text book probably has separate Chinese and Japanese fonts for the same character depending on which language it occurs in. Given the stylistic difference mentioned above, the human eye can pick these things out immediately. Given a 31-bit code space, a UCS-4 grep can too. There are other ways to do this, of course. For example, you can put in language tags (escape sequences). So now Unicode, for this purpose, looks like ISO-2022. Excuse me for not being thrilled :-) This kind of multilingual issue is not a huge deal for most people, of course. But then go a little farther: for most people, Shift-JIS does just fine. I think we've just reached Jim Breen's limit of tolerance. No JIS X 0212. :-) Unicode basically matters in really multilingual environments (like language textbooks and comparative philology) and for systems implementors. Systems implementors are not going to get acceptance for their systems unless they can (a) impose the system on the users (not likely, that's one reason why JIS C 6226-1978 was far less successful as a standard - ie, basically ignored by implementors - than JIS X 0208/0212 - the users wanted _thousands_ of extra characters), or (b) people who will use the multilingual features get the features they want. Ken> From what I've been able to gather, the controversy about Han Ken> Unification seems to come from two places. The more Ken> significant one is where to draw the line between difference Ken> of structure and style. If your name happens to use one of Ken> those characters which the Unicode people say is the same as Ken> another character, you may be out of luck. I think this is probably no longer an issue, and definitely not for Japanese, because of the "source separation rule" (if the characters are separate in a single source character set, say JIS X 0208, they should not be unified to allow round-trip conversions). This style issue, too, is really only an issue to the academics you mention below. And maybe the Chinese (especially the Taiwanese), who use a lot more different kanji than the Japanese do, and thus are more likely to end up "outside" of the standard. (According to what I heard at the M17N conference, the Chinese National Standard is likely to end up with 80,000 characters in it! Some of them created specially for the purpose, apparently ;-) Ken> The other controversy is how the characters should be Ken> classified once they are inside the Unicode space. They're Ken> grouped by structure, but since there are different ways to Ken> analyze a character, not everyone agrees which character goes Ken> where. Much of the Han Unification work was done in China, Ken> and so the Chinese approach dominates. Some Japanese Ken> academics aren't pleased. Ken> Yes, a translation table is required for Japanese. Is none Ken> required for Chinese? Is that what people are upset about? Ken> (It's a legitimate beef since the tables can consume 60Kb+). No, the Chinese do, too. As I recall from Lunde's book, the compromise is that all Unicode Han are ordered first by radical then by total stroke count. This makes the Chinese unhappy, because they do it the opposite way at least for some of their national standards. The Japanese are also unhappy, since they order by yomi and not by structure at all, for JIS Level 1. The Japanese can't complain too much though, since the order of precedence for Han unification is "common tradition" (I guess in practice that means Chinese, mostly), then Japanese, Chinese, Korean (according to Lunde's book). This mean that nobody is going to get their way; somebody else's characters are going to keep cropping up in your ordering. >> To fix this, the full Universal Character Set uses 4 bytes, and >> allows the Japanese to use JIS code and the Taiwanese GB and so >> on. Ken> Where did you hear this? I thought that Unicode is the Ken> equivalent of a UCS-2 Level-3 ISO 10646 implementation. Ken> UCS-2 means all data is managed in 2-octet or 16-bit words. Ken> UCS-4 is a 4-octet (32-bit) encoding, but currently (as of Ken> 1994, at least) UCS-2 and UCS-4 encode characters the same Ken> way. Is the UCS-4 set now defined to encode JIS characters? Ken> Can you point me to a reference about this? Sory, no reference. I heard it at the M17N conference in Tsukuba at the end of March in Ken Handa's talk on merging Mule with GNU Emacs. Proceedings remain unpublished. UCS-4 does _not_ encode characters the same way, since it's a 31-bit encoding. On the other hand, as far as I know what is so far implemented and approved for UCS-4 is the basic multilingual plane, which is the trivial remapping of UCS-2 into the UCS-4 code space. "Allow" means "allow," I should have been more careful about implying that it was actually implemented. The point about UCS-4 is that the code space is so big (2^15 times as big as UCS-2) that, for example, Mule could take some of the code space of UCS-4 and implement exactly the extract-Japanese-from- mostly-Chinese text regular expression parser I mentioned above, _which it already has implemented_ using Mule's internal CCS system, _and_ at a one-time cost of translating from CCS to UCS-4 be ISO-10646 compliant once the Mule internal character set was registered as part of ISO-10646. If I understood Handa correctly, this is what GNUMule is heading toward. Since the whole point of the technique would be baka-hodo inclusiveness ("A foolish consistency is the hobgoblin of 16-bit minds" - Emerson) you'd only have to do it once. Once standardized with an appropriate procedure for extension to new languages, everybody could use Mule codes for this kind of purpose. >> ISO8859-1 doesn't require translation tables, because it's >> mapped into Unicode with the high byte = 00. But other >> ISO-8859-* sets will require some translation to avoid >> redundancy. Ken> map of characters [0-127] from ASCII-> Unicode->SJIS is not Ken> an identity. Is this what you are talking about? Nope, I'm talking about Hebrew. Translation is still trivial as far as I know (is it ASCII? yes -> print it, no -> += Hebrew_offset and print it), but needs to be done for backward compatibility. Or Vietnamese, which uses 223 (or so) of the 256 possible 8-bit code points---that's tight when you consider that there are 33 control characters (0-31 and 127). This is no longer a simple "map to a new GR table" translation, as the ISO-8859 family is. Ken> Unicode is UCS-2 Level-3. "Level 3" means that characters Ken> may be combined without restriction, so it is wrong to assume Ken> that all characters are expressed in 16-bits. The "ch" and Ken> "ll" characters in Spanish, for example, are considered Ken> single characters of 4-octets. You're kidding. That's absolutely farcical, if true. They should be considered (and represented as) single 2-octet characters with double width when displayed. For goodness' sake, that's basically what the thousands of combined Hangul glyphs are! Can't they spare 2 or 3 code points for the Spanish? The point is that display routines have to be complicated with all sorts of special processing anyway. Think about hyphenation, proportional spacing, sizes, faces, colors, etc. This example is trivial with proportional fonts, and assuming a monospace font, you only need to use a peephole filter to catch the few characters that require two glyphs for printing), anyway. _Text in RAM_ should be uniform, to facilitate grepping and stream processing in general. >>>>> "Jim" == Jim Breen <jwb@example.com> writes: Jim> The *real* reason for translation tables is the lack of Jim> Unicode-ordered fonts. Eventually you'll get a set of fonts Jim> for the ~21,000 kanji/hanzi/.., with probbaly Jim> Japanese/Chinese/Korean/Vietnamese flavours of glyph designs. E ... ven ... tually, yes. In the short term, unless automatic design improves vastly over "Multiple Master" technology (ie, by building up glyphs from radicals and other components, instead interpolating between two hand-crafted fonts), I don't think so. That requires that Morisawa is going to design lots of pretty glyphs for characters most Japanese don't know exist. No demand, no supply. At least for Postscript (does Level 2 require CID/CMap fonts?) compatible font encodings, what is most likely IMHO is that the existing font-substitution capability will be refined so that a font will be a set of glyph libraries, each library internally indexed by arbitrary character IDs (CIDs - which will most likely for historical reasons be ordered according to the national standards, eg, kuten), and a generalized character map (CMap - currently I don't think the standard permits what I'm going to propose) which maps a given character encoding (eg, Unicode, SJIS, EUC) to (library ID, CID) pairs. These are then looked up as at present. This would then have an XFontList-like user interface, so that Japanese users of Unicode would use generic "Unicode -> JIS 208" and "Unicode -> JIS 212" CMaps, backed up by a secondary "Unicode -> Big 5" CMap, finally backed up by a default "Unicode -> Unicode" CMap, corresponding to fonts like Utsukushii-JISX0208, Mama-JISX0212, Mou-ii-BIG5, and KimochiWarui-UCS2. Since these tables are generic, they don't need to be recreated for every new font. This provides backward compatibility with old fonts and programs at the cost of supplying CMaps (since it would be built into the font engine). It would also allow you to, say, buy hand-tuned JIS X 0208 fonts at size and use a scalable JIS X 0212 font for desktop publishing. This also constitutes a partial solution to the "gaiji problem". As well as a complete solution to the "printing from Netscape problem" (I did the math wrong in my message about Netscape; Linux Netscape outputs !@#$% Shift-JIS after all!!). All you need is the SJIS -> JIS CMap if you have standard Wadalab JIS-encoded fonts. The "gaiji problem" also requires some way to arrange for on-demand distribution of gaiji glyph libraries. But the encoding problem is reduced to a CMap to a nearly universal standard, plus a secondary CMap to the gaiji library. I know how in principle to implement this as a Type 0 composite font with Type 4 descendants, even if the current definition of CMap doesn't implement the library ID. Unfortunately, I'm not a Postscript programmer, so somebody else will surely do it first. :-) -- Stephen J. Turnbull Institute of Policy and Planning Sciences Yaseppochi-Gumi University of Tsukuba http://turnbull.sk.tsukuba.ac.jp/ Tel: +81 (298) 53-5091; Fax: 55-3849 turnbull@example.com ----------------------------------------------------------------- a word from the sponsor will appear below ----------------------------------------------------------------- The TLUG mailing list is proudly sponsored by TWICS - Japan's First Public-Access Internet System. Now offering 20,000 yen/year flat rate Internet access with no time charges. Full line of corporate Internet and intranet products are available. info@example.com Tel: 03-3351-5977 Fax: 03-3353-6096
- References:
- Re: tlug: unicode
- From: Ken Schwarz <kls@example.com>
Home | Main Index | Thread Index
- Prev by Date: Re: tlug: newbie...well, potential newbie
- Next by Date: Re: tlug: newbie...well, potential newbie
- Prev by thread: Re: tlug: unicode
- Next by thread: Oops, my bad... [was: Re: tlug: unicode ]
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links