Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: SJIS & HTML - potential trouble?
- To: tlug@example.com
- Subject: Re: SJIS & HTML - potential trouble?
- From: jwb@example.com (Jim Breen)
- Date: Wed, 20 Nov 1996 12:34:03 -0500
- In-Reply-To: turnbull@example.com (Stephen J. Turnbull) "Re: SJIS & HTML - potential trouble?" (Nov 20, 9:35am)
- Reply-To: tlug@example.com
- Sender: owner-tlug
On Nov 20, 9:35am, Stephen J. Turnbull wrote: } Subject: Re: SJIS & HTML - potential trouble? ST>> Be a little fair; almost nobody writes code that isn't 8-bit clean ST>> anymore; the big problem was that "8-bit-dirty" was embedded in lots ST>> and lots of libc.a's. Oriental languages which are inherently 2-byte Don't be so sure about 8-bit clean. ST>> *must* by the RFC mix with single byte ISO646 ("bare ASCII", you might ST>> say), and that is surely hairy. And it is even worse when you are not talking about ISO646 but ISO646+Latin-x. Ask a Scandinavian trying to run Japanese applications on their localized versions of Windoze about colliding character sets. ST>> I've looked at the source for Mule, ST>> and it's hairy; no, let me say it's positively furry. I've heard it is heavy. I have poked around in kterm and jstevie, and that's quite enough. ST>> Let's at least say that Netscape 2.0 international beta regularly ST>> choked in documents including JIS and EUC codes both in auto-code mode ST>> and in assume JIS mode. To its credit, it always (in my experience) ST>> retrained to the correct mode after a couple of bytes, but I lost a ST>> few paragraph markers (I forget what "<p" is in escapeless JIS) and ST>> gained many extras that way. Auto-detection is always risky. Ken Lunde does it fairly well in jconv, and tells you if it can't cope. What I have done in xjdic & JREADER is pretty fragile, as I do it afresh for each line. For JIS212 in EUC-J non-one can cope, because it looks like SJIS. I have to force the EUC by command-line. ST>> I'm not sure what exactly is legal in HTML, but I suspect you need to ST>> read RFC-MIME as well as RFC-HTML. I wouldn't be surprised if a ST>> strict reading of the RFCs led to the conclusion that each passage in ST>> an oriental language needs to be embedded in a separate part of a MIME ST>> multi-part document. Well HTML standards are a mess, thanks to Netscape. HTML V2.0 said just ISO646, in effect (actually it is a DTD within SGML.) Most browsers extend it to the Latin-1. HTML V3.0 died, and V3.2 seems trapped in a welter of propriatary extensions. Is it an RFC at all? I think it's from the WWWC, not the IETF. ST >> I've tried to write lex code to reproduce Ken Lunde's jcode.c; it's ST >> not easy unless you're looking at Ken's source. ST >> ST >> The author of xjdic should know this, though :-) And how. I got enthusiastic some years ago, and wrote a state-driven detecter which could reliably tell SJIS, EUC and UTF-8 apart. "Normal" techniques fail because there is so much overlap, so I did it by elimination. I can't imagine trying it in lex. ST>> Let's face it, the Japanese language is fundamentally just an attempt ST>> to postpone the day when Turing's Test is passed. What you're saying ST>> is that all programmers need to learn Japanese *and* Japanese ST>> character codesets? Well, codesets anyway. When Hongbo Ni sent me the pre-Alpha version of NJSTAR I looked at it and asked, among other things "what codes do you support: EUC, SJIS, JIS?" and "how do you handle the inflections of verbs and adjectives?". To both he replied "What are they?" He learned a lot very quickly. Back to work Jim jwb@example.com ----------------------------------------------------------------- a word from the sponsor will appear below ----------------------------------------------------------------- The TLUG mailing list is proudly sponsored by TWICS - Japan's First Public-Access Internet System. Now offering 20,000 yen/year flat rate Internet access with no time charges. Full line of corporate Internet and intranet products are available. info@example.com Tel: 03-3351-5977 Fax: 03-3353-6096
- Follow-Ups:
- Re: SJIS & HTML - potential trouble?
- From: turnbull@example.com (Stephen J. Turnbull)
Home | Main Index | Thread Index
- Prev by Date: Re: SJIS & HTML - potential trouble?
- Next by Date: Re: SJIS & HTML - potential trouble?
- Prev by thread: Re: SJIS & HTML - potential trouble?
- Next by thread: Re: SJIS & HTML - potential trouble?
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links