
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] unicode and Perl- how to pass command lineunicodearguments
- Date: Wed, 15 Feb 2006 18:11:17 +0900
- From: David Riggs <dariggs@example.com>
- Subject: Re: [tlug] unicode and Perl- how to pass command lineunicodearguments
- User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US;rv:1.7.7) Gecko/20050420 Debian/1.7.7-2
Neil Bortnak said: about perl invocation argument -C:
>You missed A. IMHO, you should just use -C127 (enables all of the above)
>in a kanji/unicode heavy program because it simply makes everything
>unicode aware (except for unicode in the script, for which you still
>need the utf pragma) and that will cut down on accidental encoding
>problems.
Yes, thanks, I did miss that, and -CSioA works well.
And also:
>s/日本語/英語/ m/日本語/
>seem to work fine for me in the middle of the program. I'm using "use
>utf8;" as per normal, so I'm in a bit of wonderment as to why it
>doesn't work for you. I
The do work just fine, for MANY cases. But, I think that perl is
actually doing byte level comparison/replace, and the above strings
would work just fine as bytes (assuming your script and data are in the
same encoding.) But even at this level there are still problems: as I
mentioned earlier: if I try to match a ☆ (star: unicode E29886)
if (/^☆.*tw:(.).*jp:(.)/)
It just never works. But if I assign a star to a variable, either in the
script or from the command line, and use that, it works fine. That
really bothers me.
And the real problem is if you try to do tr/// or more complex character
sets, alternations and such in the regex, then it all breaks down unless
you are really doing unicode.
I did a whole search thing with character set skipping over punctuation,
and actually it was just in byte mode-- I never realized it until I
started to get false misses and such and finally realized that perl was
just munching bytes. It was separately skipping over all three bytes of
a unicode space character inside of a character class.
And of course the tradtional tools like tr and grep work fine with
unicode, it seems. But the results are wrong-- they are just doing bytes
(as Steven T pointed out to us some time ago.)
(Sorry, you probably already know all this...)
Thanks for the tip about -CA (kinda wishing I were back in CA myself
with this weather.)
David Riggs
Home |
Main Index |
Thread Index