
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Re: Security question with grep/e...
- Date: Sat, 27 Mar 2004 09:59:50 +1100 (EST)
- From: Jim Breen <Jim.Breen@example.com>
- Subject: Re: [tlug] Re: Security question with grep/e...
"Stephen J. Turnbull" <stephen@example.com> wrote:
>>
>> Jim> At some time in the distant future I may get the whole
>> Jim> shebang migrated to UTF8 and I'll see if I can get wide-char
>> Jim> grepping set up then. Maybe POSIX will be doing multilingual.
>>
>> For your purpose, this should work fine. According to Uli Drepper
>> (glibc maintainer), the only real issue in doing byte-by-byte regexp
>> searches with UTF-8 is efficiency.
My problem at the moment on that score is that I have a heap of mirrors,
and lowest-common-denominator rules. Working MB regexes are relatively
new in glibc, and most mirror sites use old versions (as does the Monash
server.) The only way I can do them reliably across all mirrors
is to run my own source out. I'd rather wait.
>> Same for EUC-JP, of course, main
>> problem is ensuring that you get the right flavor of bytes stuffed
>> into the regexp. People using 7-bit JIS, Shift-JIS, or a Unicode
>> variant will not get sane output searching an EUC-JP text.
Internally I use EUC-JP, both 2-byte and the 3-byte JIS X 0212 variety
(I'm probably the only person in the galaxy doing the latter.). The
server pages dish out `charset="euc-jp"' (and the server does the
matching MIME header) by default. You can set the code to SJIS, UTF8 or
ISO-2022-JP via a cookie, and the server does code-conversions on the
I/O boundary (the AIX mirror at UofVirginia dies on UTF8 as its iconv
tables aren't up to snuff.)
The point of all this is that at present I don't have to handle
anything apart from EUC-JP internally.
>> But you might be surprised---modern HTTP 1.1 with charset negotiation
>> between server and browser might get the right answer most of the time.
I'm not convinced that HTTP-level charset negotiation has much to offer
in the the case of a slab of CGI code. It's all very well when a server
has a battery of pages available in various languages, and the browser
rocks up and says: "talk to me in Greek". In the case of Japanese
it's a simpler matter - which Japanese-capable codeset/encapsulation
shall we use? Since all browsers (theoretically) support all of them,
"server rules" is a reasonable working position.
(This got broken by the bloody i-mode option, as NTT/DoCoMo decided to
lock their HTML subset into Shit_JIS. I extended the break to include the
other charsets because [a] it was easy to do some more, and [b] if you use
the French and German files, the umlauts, etc. look better in Unicode
fonts than in the horrible zenkaku JISASCII ones.)
Cheers
Jim
--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Computer Science & Software Engineering, Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia Fax: +61 3 9905 5146
(Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学
Home |
Main Index |
Thread Index