Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Japanese regex question



> We had a post on this kind of issue (ambiguous matches in UTF-8) a
> couple months back, too.   It's worth trying to remember this one.

I don't understand how it's possible to have an ambiguous UTF-8 match. It's
always clear which byte of a UTF-8 string is the first of each character.
It's a stateless encoding. The original poster was using iso-2022-jp where
you have to keep track of the states - what was the most recent escape
sequence? He could just convert to a stateless encoding like UTF-8 or even
EUC-JP using iconv (in C) or Text::Iconv module in Perl, and he wouldn't
have the problem any more.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

should explain why it's virtually impossible to miss where a Unicode
character starts.

> This is all so stupid.  XEmacs has been doing this (badly) for almost
> a decade, Mule for another 3 or 4 years longer than that.  Why Perl
> and Python failed to seize the opportunity to do it right when they
> added Unicode support I'll never know.

Mule didn't do Unicode for a long time, certainly ten years ago there was
absolutely no support for it. I was corresponding with the author of Mule,
Mr Handa, about eight years ago, and he had no support at all for Unicode
then. He was using his own internal coding called "emacs-mule" to get all
the different Chinese, Japanese and Korean character sets working. It still
is not working perfectly, unfortunately. On the other hand, Perl 5.8's
Unicode support is done exactly right as far as I can see. Switch it on with
"use utf8" and regular expressions, strings, etc. all work as if one Unicode
character was exactly equivalent to one byte. The developers are going to
make even "use utf8" unnecessary, it's said. I've no idea about Python. The
only tricky thing with Perl is input and output: it's necessary to specify
that an input or output file is utf-8 using binmode, otherwise it gets
treated as bytes. However, if you think about it, that is necessary, since
there is no guarantee that every file input or output must be in utf-8
format, and forcing everything to be read as utf-8 would cause errors on
non-utf-8 binary files.

B. Bullock.


		
___________________________________________________________ 
How much free photo storage do you get? Store your holiday 
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links