Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Japanese regex question



>>>>> "Tod" == Tod McQuillin <devin@example.com> writes:

    Tod> Yeah but the regex engine doesn't know it's not ascii.

Urk.  "Unidentified unibyte ASCII-superset", if you please!

    Tod> Unless you use unicode, it will interpret the strings as
    Tod> strings of 8-bit bytes, not as non-ascii multibyte
    Tod> characters.

Nice call!  For those of you who haven't thought carefully about it
yet, those matching 4/6 and 5/7 first-nibble pairs in the ambiguous
match positions are a dead giveaway.

We had a post on this kind of issue (ambiguous matches in UTF-8) a
couple months back, too.   It's worth trying to remember this one.

    Tod> Probably the only proper way to do this is to convert
    Tod> everything to unicode first.

This is all so stupid.  XEmacs has been doing this (badly) for almost
a decade, Mule for another 3 or 4 years longer than that.  Why Perl
and Python failed to seize the opportunity to do it right when they
added Unicode support I'll never know.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links