Re: [tlug] unicode and Perl- how to pass command lineunicodearguments

Date: Wed, 15 Feb 2006 18:11:17 +0900
From: David Riggs <dariggs@example.com>
Subject: Re: [tlug] unicode and Perl- how to pass command lineunicodearguments
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US;rv:1.7.7) Gecko/20050420 Debian/1.7.7-2

Neil Bortnak said: about perl invocation argument -C:
 >You missed A. IMHO, you should just use -C127 (enables all of the above)
 >in a kanji/unicode heavy program because it simply makes everything
 >unicode aware (except for unicode in the script, for which you still
 >need the utf pragma) and that will cut down on accidental encoding
 >problems.

Yes, thanks, I did miss that, and -CSioA works well.

And also:
 >s/日本語/英語/  m/日本語/
 >seem to work fine for me in the middle of the program. I'm using "use
 >utf8;" as per normal, so I'm in a bit of wonderment as to why it 
 >doesn't work for you. I

The do work just fine, for MANY cases. But, I think that perl is 
actually doing byte level comparison/replace, and the above strings 
would work just fine as bytes (assuming your script and data are in the 
same encoding.) But even at this level there are still problems: as I 
mentioned earlier: if I try to match a ☆ (star: unicode E29886)

if (/^☆.*tw:(.).*jp:(.)/)

It just never works. But if I assign a star to a variable, either in the 
script or from the command line, and use that, it works fine. That 
really bothers me.

And the real problem is if you try to do tr/// or more complex character 
sets, alternations and such in the regex, then it all breaks down unless 
you are really doing unicode.

I did a whole search thing with character set skipping over punctuation, 
and actually it was just in byte mode-- I never realized it until I 
started to get false misses and such and finally realized that perl was 
just munching bytes. It was separately skipping over all three bytes of 
a unicode space character inside of a character class.

  And of course the tradtional tools like tr and grep work fine with 
unicode, it seems. But the results are wrong-- they are just doing bytes 
  (as Steven T pointed out to us some time ago.)

(Sorry, you probably already know all this...)

Thanks for the tip about -CA  (kinda wishing I were back in CA myself 
with this weather.)


David Riggs

Prev by Date: Re: [tlug] unicode and Perl- how to pass command line unicodearguments
Next by Date: Re: [tlug] Japanese dictionaries
Previous by thread: [tlug] [OT] Unix System Admin Job
Next by thread: [tlug] Red Hat 7.2 Enterprise install.log.syslog times
Index(es):
- Date
- Thread

Home | Main Index | Thread Index