Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] JIS X 0212? Any example "mixed charset" pages?



"Michael(tm) Smith" <smith@example.com> wrote:
>> 
>> Does anybody have examples of non-UTF-8 web pages that mix
>> Japanese characters with European accented characters? If so, what
>> encoding do they use? 

I have a sample page with some JIS X 0212 kanji and accented characters
at: http://www.csse.monash.edu.au/~jwb/wip.html

>> My (very limited) understanding of Japanese
>> encodings leads me to believe that the way they are likely to be
>> encoded (if there are actually any of them in the wild) is in
>> EUC-JP, and that they would need to assume JIS X 0212 support in
>> whatever browser is use to view the pages.

Not so. Practically no-one on the planet uses JIS X 0212 characters
in EUC-encapsulated text in WWW pages. The reason is 
<drumroll>IE doesn't support the full EUC-JP</drumroll>.

>> The set of characters that JIS X 0212 adds support for are shown
>> here:
>> 
>>   http://kanji.zinbun.kyoto-u.ac.jp/~yasuoka/CJK/jisx0212-1990.gif
>> 
>> Most of them are additional kanji, but it includes characters for
>> European languages also ("a ring", "e acute", etc.).

Yup. I have an u-umlaut, etc. on that sample page.

>> But I've heard that Internet Explorer does not support JIS X 0212,
>> so it would seem unlikely that anybody would actually create EUC-JP
>> pages that rely on JIS X 0212 support. 

Spot on.

>> Yet, given that relatively
>> few Japanese sites seem to use UTF-8, and that JIS X 0212 is not
>> well supported, I'm left wondering how instances of these kinds of
>> "mixed charset" pages are actually encoded in the real world.

They either use UTF8 or encode the diacritics using HTML things like
&ocirc;

In my WWWJDIC server, all the data files are in EUC. If you set your
dialogue (by cookie) to run in EUC or Shit_JIS (EUC is the default), the 
output routines substitute HTML entities for the diacritics, etc. and 16x16 
images for kanji and kana (yes, JIS X 0212 has a few extra kana.) If you set 
it to UTF8, the raw codes go out. (This only applies really to the few
dictionary entries with JIS212 kanji, the German file and the Buddhism
file.)

That's how I do it anyway. I started off putting out full
EUC-encapsulated JIS212, but realised most of my users couldn't see it.

Cheers

Jim

-- 
Jim Breen                                http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology,               Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
(Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大蛙触Â


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links