
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Japanese regex question
- Date: Sun, 01 Jan 2006 23:30:30 +0100
- From: Gábor Farkas <gabor@example.com>
- Subject: Re: [tlug] Japanese regex question
- References: <200508241701.55144.jq@example.com> <20050825183913.O88704@example.com> <200508251253.47083.jq@example.com> <20050826113217.J88704@example.com> <87zmr2me23.fsf@example.com>
- User-agent: Thunderbird 1.5 (Macintosh/20051201)
Stephen J. Turnbull wrote:
>>>>>> "Tod" == Tod McQuillin <devin@example.com> writes:
>
> Tod> Yeah but the regex engine doesn't know it's not ascii.
>
> Urk. "Unidentified unibyte ASCII-superset", if you please!
>
> Tod> Unless you use unicode, it will interpret the strings as
> Tod> strings of 8-bit bytes, not as non-ascii multibyte
> Tod> characters.
>
> Nice call! For those of you who haven't thought carefully about it
> yet, those matching 4/6 and 5/7 first-nibble pairs in the ambiguous
> match positions are a dead giveaway.
>
> We had a post on this kind of issue (ambiguous matches in UTF-8) a
> couple months back, too. It's worth trying to remember this one.
>
> Tod> Probably the only proper way to do this is to convert
> Tod> everything to unicode first.
>
> This is all so stupid. XEmacs has been doing this (badly) for almost
> a decade, Mule for another 3 or 4 years longer than that. Why Perl
> and Python failed to seize the opportunity to do it right when they
> added Unicode support I'll never know.
>
sorry to jump in so late...
could you please describe to me what is Python doing wrong regarding
unicode?
thanks,
gabor
--
Flexibility is overrated
Constraints are liberating
-- David Heinemeier Hansson, Secrets behind Ruby on Rails
Home |
Main Index |
Thread Index