Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SJIS & HTML - potential trouble?



The following is munged into order of importance, I think it's still
readable.

>>>>> "Jim" == Jim Breen <jwb@example.com> writes:

    Jim> And how. I got enthusiastic some years ago, and wrote a
    Jim> state-driven detecter which could reliably tell SJIS, EUC and
    Jim> UTF-8 apart. "Normal" techniques fail because there is so
    Jim> much overlap, so I did it by elimination. I can't imagine
    Jim> trying it in lex.

Is this available publically?

You're right, you wouldn't want to do it in lex, you'd need a yacc
layer as well.  (The lex part would be useful for creating character
classes, yacc is much more convenient for explicitly tracking states,
although not well-designed for this particular application.)

Once you got into trying to tease apart different languages
automatically, you'd need to use semantic content, I guess.  Ah,
AI....

[ The following is just chat.... ]

    ST> What you're saying is that all programmers need to learn
    ST> Japanese *and* Japanese character codesets?

    Jim> Well, codesets anyway. When Hongbo Ni sent me the pre-Alpha
    Jim> version of NJSTAR I looked at it and asked, among other

But this is specifically internationalized.  True, programmers of
internationalized software need to learn it.  But can we really expect
RMS and Larry Wall to put that much effort into Emacs and Perl?  Mule
is a wonderful piece of code, although the comments left a lot to be
desired by GNU standards in 1994; JPerl (of the same vintage) was a
serious crock.  The point is that neither RMS nor Wall should take
credit or blame for the Japanizations.

I guess I'm going to have to look into localization standards more
carefully.

    Jim> Well HTML standards are a mess, thanks to Netscape. HTML V2.0
    Jim> said just ISO646, in effect (actually it is a DTD within
    Jim> SGML.) Most browsers extend it to the Latin-1. HTML V3.0
    Jim> died, and V3.2 seems trapped in a welter of propriatary
    Jim> extensions. Is it an RFC at all? I think it's from the WWWC,
    Jim> not the IETF.

There were HTML RFCs, but they may have expired before becoming
standards.  The HTTP stuff is all RFCs (written by W3C staffers, of
course).

    Jim> Don't be so sure about 8-bit clean.

Well, most Unix text filters work pretty well with Japanese; less, for 
example, works fine for me in a kterm.

But I take your point.  And of course tools using heuristics (like
glimpse or even grep) will be very language dependent, and not 8-bit
clean.  I had forgotten about those.

    Jim> And it is even worse when you are not talking about ISO646
    Jim> but ISO646+Latin-x. Ask a Scandinavian trying to run Japanese
    Jim> applications on their localized versions of Windoze about
    Jim> colliding character sets.

Bjarne Stroustrup had a snippet of what C looks like in Danish in one
of the C++ books.  Pretty funny if you don't have to deal with it.

Steve

-- 
                            Stephen J. Turnbull
Institute of Policy and Planning Sciences                    Yaseppochi-Gumi
University of Tsukuba                      http://turnbull.sk.tsukuba.ac.jp/
Tennodai 1-1-1, Tsukuba, 305 JAPAN                 turnbull@example.com
-----------------------------------------------------------------
a word from the sponsor will appear below
-----------------------------------------------------------------
The TLUG mailing list is proudly sponsored by TWICS - Japan's First
Public-Access Internet System.  Now offering 20,000 yen/year flat
rate Internet access with no time charges.  Full line of corporate
Internet and intranet products are available.   info@example.com
Tel: 03-3351-5977   Fax: 03-3353-6096


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links