TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] "UTF-8 & ISO-2022-JP"

Date: Tue, 06 Dec 2005 14:46:31 +0900

From: "Lyle (Hiroshi) Saxon" <ronfaxon@example.com>

Subject: Re: [tlug] "UTF-8 & ISO-2022-JP"

References: <4393C9A2.7000103@example.com>

Organization: Images Through Glass

User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050511

Another bit of text from the (external) general discussion (I won't makea habit of doing this, but think it's relevant in this case). - Lyle
[LHS] One non-technical observation. I've been asking my Japanesefriends about their experiences with mutated e-mail, and nearly all ofthem say that they still have trouble with that from time to time -although they say they're having less problems now than they werebefore. [LHS]
There are other reasons why unencoded mail breaks. STMP doesn't honourwhite space - usually it does but it can happen that some extra spacecharacters are added or removed. That's why Microsoft jumped on the HTMLbandwagon, because it is oblivious to adding or removing white space andnewlines, so formatting is not destroyed. Before HTML, text formattingin email was a headache.
A big problem with Asian encodings is that the multibyte (iepre-Unicode) encodings use one or two shift (escape) symbols, and that'sa brittle idea, because the shift applies to all subsequent charactersuntil it's undone. If the part of the document containing a shift islost or garbled, then all subsequent symbols lose their context andmight be mis-decoded.
This gets particularly difficult when Japanese text say is mixed withEuropean text. The document might start in English, and switch toJapanese at some point, and software might not be careful about keepingthe text as-is. In English, adding an extra space isn't the end of theworld, etc. In this sort of situation, a software system might oftenmiscostrue Japanese as some extended Latin character set.
[LHS] The fact that there are so bloody many different "standards" forJapanese text is really horrible though! [LHS]
Indeed, but it's essentially a problem with the number of ideogramsavailable in the language. The character sets were designed fordifferent size/vocabulary tradeoffs, because in the early days it seemedwasteful to reserve character codes for very rarely used symbols. So thepopular character sets are limited to save space, and that in turn meansthat there needs to be another, more complete character set forspecialized applications. Worse, even when you restrict attention to themost popular characters, 8 bytes is not enough to represent all majorAsian Chinese derived languages _simultaneously_. So China, Korea, etc.designed their own flavours which are very close but not identical. TheCyrillic based character sets also have this proliferation of flavours.Of course, close but not identical means that automatic detection of thecharacter set is hard, because all the most common language constructsare the same in all the flavours.
It was all supposed to be fixed with Unicode, but it too has problems.Full Chinese has a _lot_ of characters, and then the professionalorganizations decided they wanted space for their own symbols etc. Sothe big Unicode system uses up to 5 (or is it 6?) bytes for a singlecharacter, but with that many symbols it's not well supported. At leastUnicode doesn't use a shift sequence, but it uses up a lot of space andit doesn't play well with the C language because most characters haveNUL bytes in them. So Unicode is often mangled unless it's used withspecifically unicode aware software systems.
Then there was Microsoft, who decided to do their own thing.
Follow-Ups:

Re: [tlug] "UTF-8 & ISO-2022-JP"
From: Stephen J. Turnbull

Re: [tlug] "UTF-8 & ISO-2022-JP"
From: Andrew Hamilton

References:

[tlug] "UTF-8 & ISO-2022-JP"
From: Lyle (Hiroshi) Saxon

Prev by Date: Re: [tlug] "UTF-8 & ISO-2022-JP"

Next by Date: [tlug] TLUG List Archives

Previous by thread: Re: [tlug] "UTF-8 & ISO-2022-JP"

Next by thread: Re: [tlug] "UTF-8 & ISO-2022-JP"

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links