Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tlug: Need info. about Japanese and Linux



>>>>> "Fredric" == Fredric Fredricson <Fredric.Fredriksson@example.com> writes:

    Fredric> The customer does not know anything about this. He buys a
    Fredric> machine does not care about these details.

Bingo!  ++Steve.  :-)

    >> There are three codes in Japanese characters. They are JIS,
    >> SJIS and EUC. To convert codes there's a converter called
    >> 'nkf'.

    Fredric> Are these codes 8-bit? My concern is if I can fit it
    Fredric> inside our current system of language-specific text
    Fredric> files.

Well, strictly speaking they are 16-bit---someone who has finished
first grade in Japan has a repertoire of at least 225 characters
already, and most about 350---can't fit that into 8-bits.  Let alone
an educated adult's repertoire of about 10,000.

Technically speaking, the most common "native" encoding for Japanese
is "Packed EUC" which is an ISO-2022 conformant 8-bit code with JIS X
0201 Roman alphabet (for your purposes, ASCII +/- 2 or 3 characters)
invoked to GL/G0 and JIS X 0208 invoked to GR/G1.  Normally it uses no
shift sequences, although auxiliary character sets can be invoked to
G2 and G3.  It's unlikely you would need those extra character sets
unless you are doing entry of personal and place names.

Commonly used in messaging applications like mail and netnews is
ISO-2022-JP, which is an ISO-2022 conformant 7-bit code, using shift
sequences (ESC "$B" to designate and shift JIS X 0208 into G0/GL, and
ESC "(B" to designate and shift ASCII into G0/GL).  This has some
other restrictions which are unimportant for your immediate purpose of 
determining compatibility (eg, G0 is initialized to ASCII, each line
of the data stream must end in ASCII (before the newline), etc).

Commonly used on MS Windows and Macintosh is "Shift JIS."  Often the
"f" is omitted, to indicate that this code is an 8-bit code that
doesn't comply with anything except Microsoft's whims and will pollute 
any data channel that transmits it.  You have to accept it in general
applications (there are too many MS systems out there), but you should 
never produce it or store it internally.  (MS systems can all handle
both Packed EUC and ISO-2022-JP now, interchange is not a concern.)

Effectively never used is Unicode.  Unicode conforms to ISO-10646, of
course (and adds many further restrictions), but suffers from issues
of user preference (many Japanese personal and place names cannot be
encoded in Unicode) and programming awkwardness (the collating order
of the Japanese national standard JIS X 0208 differs from that of
Unicode).  I doubt that you would have a problem dealing with the
programming issue since it's already present when using ISO-8859-1,
although you might have to construct or at least improve the necessary
POSIX locale(s) (I haven't looked carefully in some months, but last I
looked the Japanese locales were pretty weakly implemented in glibc,
and certainly few Japanese programs use the POSIX locale model).

HTH.

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
__________________________________________________________________________
__________________________________________________________________________
What are those two straight lines for?  "Free software rules."
----------------------------------------------------------------
Next Nomikai: 20 November, 19:30   Tengu TokyoEkiMae 03-3275-3691
Next Technical Meeting: 12 December, 12:30 HSBC Securities Office
----------------------------------------------------------------
more info: http://tlug.linux.or.jp Sponsors: PHT, HSBC Securities


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links