Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: tlug: unicode
- To: tlug@example.com
- Subject: Re: tlug: unicode
- From: Ken Schwarz <kls@example.com>
- Date: Mon, 26 May 1997 20:56:54 -0400 (EDT)
- Cc: tlug@example.com
- In-Reply-To: <m0wVxVe-00005eC@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug
-------------------------------------------------------- tlug note from Ken Schwarz <kls@example.com> -------------------------------------------------------- Steve Turnbull wrote: > It's complicated, like most of this stuff. For one thing, there's > ISO-10646 and there's Unicode. Unicode is a subset of full ISO-10646 > which fits into 16bits and covers all of the world's character sets > more or less, at the cost of getting glyphs wrong. Ie, you probably > are aware that Chinese Chinese characters look different from Japanese > Chinese characters, which are different again from Korean Chinese > characters, even when they're the same. Unicode coerces them all into > the same character set, at the cost of forcing a single font. So > Japanese Unicode fonts will look like Japanese even when displaying > Chinese. Readable (if you can read them), but not pretty (according > to natives). I thought that the controversy about Unicode was a much narrower issue. The "Han Unification" process was supposed to organize characters on three axes: meaning, shape, and style. A character code in the Unicode space is assigned only to characters of distinct meaning or shape. Differences of style are typeface differences. This is analogous to the difference between Times Roman and Courier. Both fonts contain the character "lowercase a" but the glyphs which represent these characters are completely different. What Unicode is supposed to do is to give you a universal character assignment scheme, not a universal font. If you want Chinese and Japanese, you still need two fonts, but only one symbol set map. You shouldn't need to switch fonts if you are using one language, though. It's true that Unicode omits a range of SJIS characters found only in Microsoft Windows. While useful and fairly popular (such as circled roman numerals), the Unicode committees did not recognize them as legitimate characters, and so they do not have Unicode mappings. These characters are deprecated everywhere so we should see less and less of them as time goes on. (I heard that Mac users refer to these characters as "goma dofu" since they are just filled blocks on the Mac screen. If you think that the SJIS "pseudo-character" problem is bad, try Korean. I heard that there are hundreds of codes which are really graphic characters (hearts, diamonds, etc.)) >From what I've been able to gather, the controversy about Han Unification seems to come from two places. The more significant one is where to draw the line between difference of structure and style. If your name happens to use one of those characters which the Unicode people say is the same as another character, you may be out of luck. So, John Smyth becomes John Smith. One explanation I read was that this is a result of different editions (1983 and 1990) of the Japanese character standards as the basis of SJIS and Unicode. But I haven't been able to find any specific examples of this problem that I can check out myself. The other controversy is how the characters should be classified once they are inside the Unicode space. They're grouped by structure, but since there are different ways to analyze a character, not everyone agrees which character goes where. Much of the Han Unification work was done in China, and so the Chinese approach dominates. Some Japanese academics aren't pleased. > To fix this, the full Universal Character Set uses 4 bytes, and allows > the Japanese to use JIS code and the Taiwanese GB and so on. Where did you hear this? I thought that Unicode is the equivalent of a UCS-2 Level-3 ISO 10646 implementation. UCS-2 means all data is managed in 2-octet or 16-bit words. UCS-4 is a 4-octet (32-bit) encoding, but currently (as of 1994, at least) UCS-2 and UCS-4 encode characters the same way. Is the UCS-4 set now defined to encode JIS characters? Can you point me to a reference about this? > 16-bit Unicode requires translation tables, because (a) the various > languages don't even agree on the "basic 1000" and (b) they don't > order the ones they do agree on the same (eg, JIS orders level 1 kanji > by yomi but Chinese orders all hanzi by radical and stroke count). Yes, a translation table is required for Japanese. Is none required for Chinese? Is that what people are upset about? (It's a legitimate beef since the tables can consume 60Kb+). > ISO8859-1 doesn't require translation tables, because it's mapped into > Unicode with the high byte = 00. But other ISO-8859-* sets will > require some translation to avoid redundancy. I am aware that the 0-127 range of single-byte SJIS characters are "JIS-Roman", *not* ASCII, even though they look much like ASCII. Specifically, the characters {'\', '~' , '|'} are different. The practical significance is that the map of characters [0-127] from ASCII->Unicode->SJIS is not an identity. Is this what you are talking about? > Unicode-compliant mostly means not trashing 16-bit codes. This is > mostly a problem for C-like string processing, I believe. Any string processing that counts on NULL-terminated strings have to be re-written to work with 16-bit units. You have to watch out for endian problems. Unicode is officially big-endian, but Windows, at least, stores Unicode values in little-endian form internally. Unless, of course, you're running Java...then Unicode is big-endian, even on Intel machines. Unicode is UCS-2 Level-3. "Level 3" means that characters may be combined without restriction, so it is wrong to assume that all characters are expressed in 16-bits. The "ch" and "ll" characters in Spanish, for example, are considered single characters of 4-octets. The bottom line is that the idea that 16-bits is *not* the same as a "Unicode" character, even though Windows NT and Java seem to treat them that way. I haven't been able to get much information about this, but it sounds like a serious gotcha since "universal" text processing must be sensitive to this. As far as I can tell, though, this issue has nothing to do with Japanese or other kanji fonts. But the idea that Unicode means all characters (in the semantic sense) are 16-bits is wrong. Oh, well. - Ken ----------------------------------------------------------------- a word from the sponsor will appear below ----------------------------------------------------------------- The TLUG mailing list is proudly sponsored by TWICS - Japan's First Public-Access Internet System. Now offering 20,000 yen/year flat rate Internet access with no time charges. Full line of corporate Internet and intranet products are available. info@example.com Tel: 03-3351-5977 Fax: 03-3353-6096
- Follow-Ups:
- Re: tlug: unicode
- From: "Stephen J. Turnbull" <turnbull@example.com>
- References:
- Re: tlug: unicode
- From: "Stephen J. Turnbull" <turnbull@example.com>
Home | Main Index | Thread Index
- Prev by Date: Re: tlug: Font problem
- Next by Date: Re: tlug: DeskJet 560J + ps
- Prev by thread: Re: tlug: unicode
- Next by thread: Re: tlug: unicode
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links