Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Japanese regex question



On Thu, 25 Aug 2005, Jonathan Byrne wrote:

> On Thursday 25 August 2005 02:40, Tod McQuillin wrote:
>
>> Just a guess -- have you given the 'i' flag (case insensitivity)
> somehow?
>
> Actually, now that you mention it, yes.  [...]
>
> Not sure if this is our problem, b/c there was no ASCII involved in the
> strings that were matched, but I'll look into it.

Yeah but the regex engine doesn't know it's not ascii.  Unless you use 
unicode, it will interpret the strings as strings of 8-bit bytes, not as 
non-ascii multibyte characters.

Which means that if the encoding happens to include upper/lowercase 
letters as part of the string when interpreted as bytewise ascii ... you 
lose if 'i' was specified.

Even though, as you say, there was no ASCII involved in your strings, 
there was in fact a 'j' and 'J' ascii byte in there, because the encoding 
dictated it.

Probably the only proper way to do this is to convert everything to 
unicode first.
-- 
Tod


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links