Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SJIS & HTML - potential trouble?



On Nov 20,  9:35am, Stephen J. Turnbull wrote:
} Subject: Re: SJIS & HTML - potential trouble?

ST>> Be a little fair; almost nobody writes code that isn't 8-bit clean
ST>> anymore; the big problem was that "8-bit-dirty" was embedded in lots
ST>> and lots of libc.a's.  Oriental languages which are inherently 2-byte

Don't be so sure about 8-bit clean.

ST>> *must* by the RFC mix with single byte ISO646 ("bare ASCII", you might 
ST>> say), and that is surely hairy.

And it is even worse when you are not talking about ISO646 but
ISO646+Latin-x. Ask a Scandinavian trying to run Japanese applications on
their localized versions of Windoze about colliding character sets.

ST>>  I've looked at the source for Mule,
ST>> and it's hairy; no, let me say it's positively furry.

I've heard it is heavy. I have poked around in kterm and jstevie, and
that's quite enough.

ST>> Let's at least say that Netscape 2.0 international beta regularly
ST>> choked in documents including JIS and EUC codes both in auto-code mode
ST>> and in assume JIS mode.  To its credit, it always (in my experience)
ST>> retrained to the correct mode after a couple of bytes, but I lost a
ST>> few paragraph markers (I forget what "<p" is in escapeless JIS) and
ST>> gained many extras that way.

Auto-detection is always risky. Ken Lunde does it fairly well in jconv,
and tells you if it can't cope. What I have done in xjdic & JREADER is
pretty fragile, as I do it afresh for each line. For JIS212 in EUC-J
non-one can cope, because it looks like SJIS. I have to force the EUC by
command-line.

ST>> I'm not sure what exactly is legal in HTML, but I suspect you need to
ST>> read RFC-MIME as well as RFC-HTML.  I wouldn't be surprised if a
ST>> strict reading of the RFCs led to the conclusion that each passage in
ST>> an oriental language needs to be embedded in a separate part of a MIME
ST>> multi-part document.

Well HTML standards are a mess, thanks to Netscape. HTML V2.0 said just
ISO646, in effect (actually it is a DTD within SGML.) Most browsers extend
it to the Latin-1. HTML V3.0 died, and V3.2 seems trapped in a welter of
propriatary extensions. Is it an RFC at all? I think it's from the WWWC, not
the IETF.

ST >> I've tried to write lex code to reproduce Ken Lunde's jcode.c; it's
ST >> not easy unless you're looking at Ken's source.
ST >> 
ST >> The author of xjdic should know this, though :-)

And how. I got enthusiastic some years ago, and wrote a state-driven
detecter which could reliably tell SJIS, EUC and UTF-8 apart. "Normal"
techniques fail because there is so much overlap, so I did it by
elimination. I can't imagine trying it in lex.

ST>> Let's face it, the Japanese language is fundamentally just an attempt
ST>> to postpone the day when Turing's Test is passed.  What you're saying
ST>> is that all programmers need to learn Japanese *and* Japanese
ST>> character codesets?

Well, codesets anyway. When Hongbo Ni sent me the pre-Alpha version of
NJSTAR I looked at it and asked, among other things "what codes do you
support: EUC, SJIS, JIS?" and "how do you handle the inflections of verbs
and adjectives?". To both he replied "What are they?" He learned a lot
very quickly.

Back to work

Jim
jwb@example.com

-----------------------------------------------------------------
a word from the sponsor will appear below
-----------------------------------------------------------------
The TLUG mailing list is proudly sponsored by TWICS - Japan's First
Public-Access Internet System.  Now offering 20,000 yen/year flat
rate Internet access with no time charges.  Full line of corporate
Internet and intranet products are available.   info@example.com
Tel: 03-3351-5977   Fax: 03-3353-6096


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links