Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SJIS & HTML - potential trouble?



>>>>> "Jim" == Jim Breen <jwb@example.com> writes:

    Jim> Why on earth would SJIS be dear to anyone's heart??

:-)  Bill Gates must love the wonderful joke he played on thousands of 
hapless nihongo programmers.

    Jim> Well I can quite imagine some Americo-centric programmer
    Jim> stumbling on codes > 128. OTOH, do they really write parsers
    Jim> that could not handle the ISO-8859-1 codes wich are very
    Jim> widely used in Europe?

Be a little fair; almost nobody writes code that isn't 8-bit clean
anymore; the big problem was that "8-bit-dirty" was embedded in lots
and lots of libc.a's.  Oriental languages which are inherently 2-byte
*must* by the RFC mix with single byte ISO646 ("bare ASCII", you might 
say), and that is surely hairy.

    Jim> Seriously, though, people writing parsers, etc, should be
    Jim> producing code which is: (a) configurable for a series of
    Jim> muti-byte codes with the MSB set and not set (b) able to
    Jim> handle the UTF codings of Unicode/ISO10646

You don't ask for much, do you?  I've looked at the source for Mule,
and it's hairy; no, let me say it's positively furry.

Let's at least say that Netscape 2.0 international beta regularly
choked in documents including JIS and EUC codes both in auto-code mode
and in assume JIS mode.  To its credit, it always (in my experience)
retrained to the correct mode after a couple of bytes, but I lost a
few paragraph markers (I forget what "<p" is in escapeless JIS) and
gained many extras that way.

I'm not sure what exactly is legal in HTML, but I suspect you need to
read RFC-MIME as well as RFC-HTML.  I wouldn't be surprised if a
strict reading of the RFCs led to the conclusion that each passage in
an oriental language needs to be embedded in a separate part of a MIME
multi-part document.

What really needs to be done is a solid GPL-(or freer)-license lexing
library which does all the above and also is extensible for national
standards which are old and incompatible with the Unicode standard.
This is not a project I'm willing to attempt at present, though.
Presumably the Mule internal routines could be adapted, or jcode.c
converted into a library (although the latter is just as Japan-centric 
as 7-bit ASCII is Americo-centric).

>>>>> "Jim" == Jim Breen <jwb@example.com> writes:

    Jim> "Proper" handling of SJIS (an oxymoron if there ever was one)
    Jim> involves a lot of checking for valid/invalid sequences, as
    Jim> you have to cater for the unspeakable hankaku katakana as

"Unspeakable?"  Is this a technical linguistics term?  :-)

    Jim> well. Trying to scan backwards, e.g. in a WP program, through
    Jim> some raw SJIS sends you grey. Usually developers do something
    Jim> like holding everything as 16-bit codes internally.

Mule uses 32-bit codes, mostly!!

    Jim> (No excuse for bad parsing, though.)

I've tried to write lex code to reproduce Ken Lunde's jcode.c; it's
not easy unless you're looking at Ken's source.

The author of xjdic should know this, though :-)

Let's face it, the Japanese language is fundamentally just an attempt
to postpone the day when Turing's Test is passed.  What you're saying
is that all programmers need to learn Japanese *and* Japanese
character codesets?

Realistically, we're going to have to accept that Japanese is not
going to be well-handled by most software for some time to come, until 
most authors are using Unicode-emitting tools.  Right.  You really
expect that on your Sharp Internet-capable wapuro?  Maybe, but...

-- 
                            Stephen J. Turnbull
Institute of Policy and Planning Sciences                    Yaseppochi-Gumi
University of Tsukuba                      http://turnbull.sk.tsukuba.ac.jp/
Tennodai 1-1-1, Tsukuba, 305 JAPAN                 turnbull@example.com
-----------------------------------------------------------------
a word from the sponsor will appear below
-----------------------------------------------------------------
The TLUG mailing list is proudly sponsored by TWICS - Japan's First
Public-Access Internet System.  Now offering 20,000 yen/year flat
rate Internet access with no time charges.  Full line of corporate
Internet and intranet products are available.   info@example.com
Tel: 03-3351-5977   Fax: 03-3353-6096


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links