Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] I hate encodings!



>>>>> "Attila" == Attila Kinali <attila@example.com> writes:

    Attila> On Tue, 29 Aug 2006 12:08:01 +0900
    Attila> "Jeff Madsen" <jeff@example.com> wrote:

    >> Hope that question made sense - you can probably detect my
    >> confusion already!

    Attila> As far as i know there is no such documentation.

Of course there is.  Some of the worst introductory stuff for
encodings etc was written by yours truly ;-), there's an existence
proof for you.  I know there's better by now, but I don't know where
off hand.

_Linux Journal_, "Alphabet Soup", ca. Mar-Apr 1999 IIRC.
_Professional Linux Programming_, Ch. 28, "Internationalization", Wrox
    Press, 2000.
_Linux Nihongo Kankyo_, O'Reilly Japan, 1999 or so (with Craig Oda,
    Hiroo Yamagata, and Rob Bickel.

There's some stuff at debian.org.

Ken Lunde's "Understanding Japanese Information Processing" (O'Reilly,
often referenced as "UJIP") is excellent but low-level (doesn't
discuss the web at all), now superseded by his "Chinese, Japanese,
Korean, and Vietnamese Information Processing" (also O'Reilly, often
referenced as "CJKV"), which I haven't actually read.  I think they're
both out of print now in English.

For web stuff, you want to find out about content negotiation in
HTTP.  You will need to read the MIME RFCs (2045--2049), some of which
are only really relevant to mail, but I forget which you can omit
offhand.  Apache's documentation on its mechanisms is good but assumes
you know a lot in advance.

The Unicode Consortium site has some good but really technical stuff;
TR#17 is worth skimming to get an idea of the issues.

Dealing with Japanese is a pain in the butt because (1) the Japanese
have 5 major encodings in common use (JIS/ISO-2022-JP, EUC-JP,
Shift-JIS, Unicode UTF-8, and romaji---eg, domain names), and each has
many minor variations, and (2) Japanese mostly don't care about
anything else yet so a lot of Japanese sites (even today) assume that
the language is Japanese or US English so the various encodings are
easy to tell apart automatically---which means they often don't
implement charset negotiation.

Attila's general advice is excellent, so I won't repeat or comment
here.  I will add that if you're going to put everything into UTF-8 as
he suggests (but you may run into opposition from Japanese
colleagues), you should have a language tag.  Someday you will need to
mix Chinese or Korean with Japanese content, and then you'll be glad
you did.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links