Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] "UTF-8 & ISO-2022-JP"



Another bit of text from the (external) general discussion (I won't make a habit of doing this, but think it's relevant in this case). - Lyle


[LHS] One non-technical observation. I've been asking my Japanese friends about their experiences with mutated e-mail, and nearly all of them say that they still have trouble with that from time to time - although they say they're having less problems now than they were before. [LHS]

There are other reasons why unencoded mail breaks. STMP doesn't honour white space - usually it does but it can happen that some extra space characters are added or removed. That's why Microsoft jumped on the HTML bandwagon, because it is oblivious to adding or removing white space and newlines, so formatting is not destroyed. Before HTML, text formatting in email was a headache.

A big problem with Asian encodings is that the multibyte (ie pre-Unicode) encodings use one or two shift (escape) symbols, and that's a brittle idea, because the shift applies to all subsequent characters until it's undone. If the part of the document containing a shift is lost or garbled, then all subsequent symbols lose their context and might be mis-decoded.

This gets particularly difficult when Japanese text say is mixed with European text. The document might start in English, and switch to Japanese at some point, and software might not be careful about keeping the text as-is. In English, adding an extra space isn't the end of the world, etc. In this sort of situation, a software system might often miscostrue Japanese as some extended Latin character set.

[LHS] The fact that there are so bloody many different "standards" for Japanese text is really horrible though! [LHS]

Indeed, but it's essentially a problem with the number of ideograms available in the language. The character sets were designed for different size/vocabulary tradeoffs, because in the early days it seemed wasteful to reserve character codes for very rarely used symbols. So the popular character sets are limited to save space, and that in turn means that there needs to be another, more complete character set for specialized applications. Worse, even when you restrict attention to the most popular characters, 8 bytes is not enough to represent all major Asian Chinese derived languages _simultaneously_. So China, Korea, etc. designed their own flavours which are very close but not identical. The Cyrillic based character sets also have this proliferation of flavours. Of course, close but not identical means that automatic detection of the character set is hard, because all the most common language constructs are the same in all the flavours.

It was all supposed to be fixed with Unicode, but it too has problems. Full Chinese has a _lot_ of characters, and then the professional organizations decided they wanted space for their own symbols etc. So the big Unicode system uses up to 5 (or is it 6?) bytes for a single character, but with that many symbols it's not well supported. At least Unicode doesn't use a shift sequence, but it uses up a lot of space and it doesn't play well with the C language because most characters have NUL bytes in them. So Unicode is often mangled unless it's used with specifically unicode aware software systems.

Then there was Microsoft, who decided to do their own thing.




Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links