Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Japanese regex question



Stephen J. Turnbull wrote:
>>>>>> "Tod" == Tod McQuillin <devin@example.com> writes:
> 
>     Tod> Yeah but the regex engine doesn't know it's not ascii.
> 
> Urk.  "Unidentified unibyte ASCII-superset", if you please!
> 
>     Tod> Unless you use unicode, it will interpret the strings as
>     Tod> strings of 8-bit bytes, not as non-ascii multibyte
>     Tod> characters.
> 
> Nice call!  For those of you who haven't thought carefully about it
> yet, those matching 4/6 and 5/7 first-nibble pairs in the ambiguous
> match positions are a dead giveaway.
> 
> We had a post on this kind of issue (ambiguous matches in UTF-8) a
> couple months back, too.   It's worth trying to remember this one.
> 
>     Tod> Probably the only proper way to do this is to convert
>     Tod> everything to unicode first.
> 
> This is all so stupid.  XEmacs has been doing this (badly) for almost
> a decade, Mule for another 3 or 4 years longer than that.  Why Perl
> and Python failed to seize the opportunity to do it right when they
> added Unicode support I'll never know.
> 

sorry to jump in so late...

could you please describe to me what is Python doing wrong regarding 
unicode?

thanks,
gabor

-- 
Flexibility is overrated
Constraints are liberating
-- David Heinemeier Hansson, Secrets behind Ruby on Rails


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links