Re: tlug: Re: pine, mutt, Chinese, Japanese

To: tlug@example.com
Subject: Re: tlug: Re: pine, mutt, Chinese, Japanese
From: "Stephen J. Turnbull" <turnbull@example.com>
Date: Wed, 4 Aug 1999 14:26:13 +0900 (JST)
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <Pine.LNX.3.95.990725134553.263B-100000@example.com>
References: <Pine.LNX.4.05.9907250938270.4720-100000@example.com><Pine.LNX.3.95.990725134553.263B-100000@example.com>
Reply-To: tlug@example.com
Sender: owner-tlug@example.com

>>>>> "jdb" == J David Beutel <jdb@example.com> writes:

    jdb> On Sun, 25 Jul 1999, Tony Laszlo wrote:
    >> Setting Unicode aside for the moment, is there any _single_
    >> Japanese encoding that has been suggested to take the place of
    >> euc, jis and sjis?  Seems like a needless hassle having to
    >> convert between two or three and make sure that software can
    >> display all of the three.

Talk to Microsoft and Apple about SJIS.  Don't bet on getting a
sensible reply; Microsoft uses Unicode internally but doesn't provide
any software (well, Word-2000 is supposed to) to handle that format,
instead making SJIS the default.

JIS is used almost exclusively in messaging applications---mail and
netnews---in a rather usable variant of ISO-2022.  Due to rules in
RFC-822, it is unlikely that 7-bit encodings in mail headers will go
away soon.  So you'll keep seeing `=?iso-2022-jp?B?...' in raw mail
headers for a while.  Then you'll start seeing '=?utf-7?B?' or so....
EUC-JP is a rather efficient and simple encoding for Japanese only, so
it makes sense to use it for file systems.  (Although if you compress
the files, the advantage over ISO-2022-JP for most files will almost
completely go away.  Almost all the kanji-in/kanji-out sequences will
be treated as part of the newline sequence, and everything else is a
1-1 map.)

The 7/8-bit (JIS/EUC) thing affects Chinese and Korean, too, I believe.

    jdb> Unicode is exactly it.  Why set it aside?  I doubt there is
    jdb> any other.

Well, no, Unicode is not exactly it.  UCS (ISO-10646) is.  Unicode is
just a 99.44% accurate approximation.  ;-)

Unicode is going to require a certain amount of implementation of
infrastructure.  The problem is that Unicode does not preserve
collating orders and the like for anything except American English
(and maybe British English).  So sorts are going to have to be
table-driven.  This is actually a good thing; JIS order isn't really
all that interesting.  It would make it very easy to specify a sort
like "kyouiku kanji by year, first, then jouyou kanji, then other
Japanese kanji, then non-Japanese kanji, then other characters" by
writing appropriate tables.  (Not to mention "unifying" zen and
hankaku romaji, etc.)

But that's very inefficient.  So a good general-purpose UCS text
sorter is going to need to preprocess a text to be sorted so that
characters are in collation order, not in UCS order.  That's going to
take a while to shake out; there will be lots of reimplementations due 
to NIH-itis, most of them buggy, many developers will be too lazy,
etc.

And there are gonna be lots of gotchas.  For example, what does
`[a-z]' mean in a regexp?  Well, presumably it changes according to
the language; normally I can't see it including `1' but surely in
es_ES locales it will include enye.  But everybody has their own
favorite flavor of regexp; I bet hardly anybody uses the standard C
library versions for languages like Perl, and so on.  More
reimplementations....

As for "no other", the answer is (according to rumor), unfortunately,
"not yet".  Evidently JIS is working on a unification of JIS X 0208,
JIS X 0212, and JIS X 0213.  Presumably it's mostly going to be a
regularization and slight tweaking of the familiar sets, but no,
they're not planning on going to UCS in any form as a Japanese
national standard any time soon.  Only the US can really do this,
since all pure ASCII documents are already encoded in UTF-8 ;-)

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
__________________________________________________________________________
__________________________________________________________________________
What are those two straight lines for?  "Free software rules."
-------------------------------------------------------------------
Next Technical Meeting: August 14 (Sat), 13:00  place: Temple Univ.
*** Special guest: Marc Christensen (Salt Lake Linux Users Group)
Next Nomikai: September 20 (Fri), 19:30 Tengu TokyoEkiMae 03-3275-3691
-------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan

Prev by Date: tlug: J-Mutt build errors
Next by Date: tlug: What decides Japanese file name encoding?
Prev by thread: tlug: J-Mutt build errors
Next by thread: tlug: What decides Japanese file name encoding?
Index(es):
- Date
- Thread

Home | Main Index | Thread Index