Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: SJIS & HTML - potential trouble?



>Has anyone seen anything written about problems of SJIS text confusing
>HTML parsers?  I haven't had time to think it through, but it seems
>likely that second bytes of SJIS could confuse naive HTML parsers.
>I'd expect that EUC is less troublesome, but worth checking as well.
>
It is fairly safe. HTML text has to be in ISO-Latin 1 character set, which
has reserved space for the first byte of SJIS double-byte characters.
Second byte range is 64..126, and there are no HTML special meaning
characters in that range.

Any character above 128, as well as the special meaning characters
(<,>,",&,etc.), have to be encoded when used in URL's (which includes
parameters being passed to a cgi program). The encoding is % followed by the
hex value of the character.
The browser will do this for you, so the only thing you have to do it decode
in your cgi program (eg. the 0x90ab SJIS character would become %90%AB - 6
characters in your input stream). The standard perl and C libraries normally
decode these hex values for you, so there should be no problems.

Darren


-----------------------------------------------------------------
a word from the sponsor will appear below
-----------------------------------------------------------------
The TLUG mailing list is proudly sponsored by TWICS - Japan's First
Public-Access Internet System.  Now offering 20,000 yen/year flat
rate Internet access with no time charges.  Full line of corporate
Internet and intranet products are available.   info@example.com
Tel: 03-3351-5977   Fax: 03-3353-6096


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links