
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Japanese regex question
>>>>> "Tod" == Tod McQuillin <devin@example.com> writes:
Tod> Yeah but the regex engine doesn't know it's not ascii.
Urk. "Unidentified unibyte ASCII-superset", if you please!
Tod> Unless you use unicode, it will interpret the strings as
Tod> strings of 8-bit bytes, not as non-ascii multibyte
Tod> characters.
Nice call! For those of you who haven't thought carefully about it
yet, those matching 4/6 and 5/7 first-nibble pairs in the ambiguous
match positions are a dead giveaway.
We had a post on this kind of issue (ambiguous matches in UTF-8) a
couple months back, too. It's worth trying to remember this one.
Tod> Probably the only proper way to do this is to convert
Tod> everything to unicode first.
This is all so stupid. XEmacs has been doing this (badly) for almost
a decade, Mule for another 3 or 4 years longer than that. Why Perl
and Python failed to seize the opportunity to do it right when they
added Unicode support I'll never know.
--
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
Home |
Main Index |
Thread Index