
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Japanese regex question
- Date: Sun, 01 Jan 2006 23:30:30 +0100
 
- From: Gábor Farkas <gabor@example.com>
 
- Subject: Re: [tlug] Japanese regex question
 
- References: <200508241701.55144.jq@example.com>	<20050825183913.O88704@example.com>	<200508251253.47083.jq@example.com>	<20050826113217.J88704@example.com> <87zmr2me23.fsf@example.com>
 
- User-agent: Thunderbird 1.5 (Macintosh/20051201)
 
Stephen J. Turnbull wrote:
>>>>>> "Tod" == Tod McQuillin <devin@example.com> writes:
> 
>     Tod> Yeah but the regex engine doesn't know it's not ascii.
> 
> Urk.  "Unidentified unibyte ASCII-superset", if you please!
> 
>     Tod> Unless you use unicode, it will interpret the strings as
>     Tod> strings of 8-bit bytes, not as non-ascii multibyte
>     Tod> characters.
> 
> Nice call!  For those of you who haven't thought carefully about it
> yet, those matching 4/6 and 5/7 first-nibble pairs in the ambiguous
> match positions are a dead giveaway.
> 
> We had a post on this kind of issue (ambiguous matches in UTF-8) a
> couple months back, too.   It's worth trying to remember this one.
> 
>     Tod> Probably the only proper way to do this is to convert
>     Tod> everything to unicode first.
> 
> This is all so stupid.  XEmacs has been doing this (badly) for almost
> a decade, Mule for another 3 or 4 years longer than that.  Why Perl
> and Python failed to seize the opportunity to do it right when they
> added Unicode support I'll never know.
> 
sorry to jump in so late...
could you please describe to me what is Python doing wrong regarding 
unicode?
thanks,
gabor
-- 
Flexibility is overrated
Constraints are liberating
-- David Heinemeier Hansson, Secrets behind Ruby on Rails
Home |
Main Index |
Thread Index