Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Japanese regex question

----- Original Message ----- 
From: "Stephen J. Turnbull" <>
To: <>
Sent: Monday, August 29, 2005 3:15 PM
Subject: Re: [tlug] Japanese regex question

>>>>>> "Botond" == Botond Botyanszki <> writes:
>    Botond> I had the impression while coding in perl that it was
>    Botond> handling text in unicode. And it seems to be the case
>    Botond> according to the FAQ at
>    Botond>
> Could very well be.  I haven't done anything in Perl since
> Hanshin-Awajishima Daishinsai (ie, about Feb 1 1995), so I don't
> know.  However, perusing that FAQ suggests to me that the default is
> unspecified unibyte ASCII superset, not UTF-8.  If you want to treat
> the strings as Unicode you need to use special functions.  It looked
> like you need to enable locale support rather than having it done
> automatically.  Etc, etc.

I've been reading this discussion and thinking whether or not to reply. I 
wrote a reply to another message of yours yesterday but decided not to send 
it, but now I've changed my mind, and I'll send another response as well. 
Since this might just be useful for someone, let's point out how hard it is 
to use utf-8 in Perl.

To get Perl to use UTF-8, try

use utf8;

Then each Unicode character is exactly equivalent to an ascii character for 
every purpose. That's all you need to make, for example "." in a regular 
expression match all Unicode characters, or to use UTF8 variable names in 
your code, or to make

length ("馬鹿") == 2;

rather than 4 or 6, etc. etc. In future versions of Perl, "use uft8;" is 
going to become a non-functioning command and utf8 will be switched on by 

The only thing this does not do is turn on input and output to files in 
utf-8. To get Perl to understand that a file is in UTF-8 format, one has to 

binmode FILE, ":utf-8";

Note that "binmode" is the Perl command which can turn on or off the "text" 
mode for output. The "text" mode is necessary for things like ensuring the 
right newline/carriage return stuff for text input and output depending on 
whether we're in Unix or Dos or etc. One uses

binmode FILE, ":raw";

to read in raw bytes without this conversion.

So it's actually a very sensible compromise to have a utf-8 handle, I think; 
it doesn't break legacy code.

> In other words, it looks to me like by default Perl 5.6 supported I18N
> oblivious programming, with minimal I18N being easy, but not default.

Perl is a 20 year old programming language and it supports backward 
compatibility with old versions of itself, including a whole bunch of things 
which are now more or less superceded. I completely disagree with you; I 
think the Perl designers have got this issue right and that the Unicode 
support in Perl is excellent.

How much free photo storage do you get? Store your holiday 
snaps for FREE with Yahoo! Photos

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links