Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Japanese regex question

>>>>> "Tod" == Tod McQuillin <> writes:

    Tod> Yeah but the regex engine doesn't know it's not ascii.

Urk.  "Unidentified unibyte ASCII-superset", if you please!

    Tod> Unless you use unicode, it will interpret the strings as
    Tod> strings of 8-bit bytes, not as non-ascii multibyte
    Tod> characters.

Nice call!  For those of you who haven't thought carefully about it
yet, those matching 4/6 and 5/7 first-nibble pairs in the ambiguous
match positions are a dead giveaway.

We had a post on this kind of issue (ambiguous matches in UTF-8) a
couple months back, too.   It's worth trying to remember this one.

    Tod> Probably the only proper way to do this is to convert
    Tod> everything to unicode first.

This is all so stupid.  XEmacs has been doing this (badly) for almost
a decade, Mule for another 3 or 4 years longer than that.  Why Perl
and Python failed to seize the opportunity to do it right when they
added Unicode support I'll never know.

School of Systems and Information Engineering
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links