Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Japanese regex question




----- Original Message ----- 
From: "Josh Glover" <jmglov@example.com>
To: <tlug@example.com>
Sent: Tuesday, August 30, 2005 2:38 AM
Subject: Re: [tlug] Japanese regex question


> On 8/29/05, Stephen J. Turnbull <stephen@example.com> wrote:
>
>> >>>>> "Ben" == Ben K Bullock <benkasminbullock@example.com> writes:
>>
>>     Ben> So it's actually a very sensible compromise to have a utf-8
>>     Ben> handle, I think; it doesn't break legacy code.
>>
>> Please note that my proposal also does not turn the existing breakage
>> in legacy code into a showstopper; it simply requires the programmer
>> or user to admit that the legacy code is broken by _explicitly_ using
>> a backward compatibility option.
>
> Your proposal is excellent, Stephen, and you will be happy to know
> that is how things work (or rather, will work) in Perl 6. There is a
> 'use Perl5' pragma that causes Just-In-Time compilation of Perl 5
> source to Parrot bytecode.
>
> What Ben says about the 'use utf-8' pragma is true of Perl 5.6, but
> not of Perl 5.8, which is the first Perl version to use Unicode
> internally for *all strings*. The Perl documentation agrees with
> Stephen that 'use utf-8' was a poor design choice:
>
> http://search.cpan.org/~jhi/perl-5.8.0/pod/perluniintro.pod#Perl's_Unicode_Support

Thanks for that link. I read it carefully, but I can't see any disagreement 
between it and what I said:

>>>>>>>>>>>>>>>> Begin quote of me

To get Perl to use UTF-8, try

use utf8;

Then each Unicode character is exactly equivalent to an ascii character for
every purpose. That's all you need to make, for example "." in a regular
expression match all Unicode characters, or to use UTF8 variable names in
your code, or to make

length ("馬鹿") == 2;

rather than 4 or 6, etc. etc. In future versions of Perl, "use uft8;" is
going to become a non-functioning command and utf8 will be switched on by
default.

<<<<<<<<<<<<< End quote of me.

I'm fairly sure this says the same thing as your above page, just from a 
slightly different perspective. The perspective of the web page is someone 
talking about the internals of Perl, and my perspective is someone using 
Perl. Let's try to demonstrate this with a small example program:
>>>>>>>>>>>>>>>>>> Please save this as utf-8 or it won't 
>>>>>>>>>>>>>>>>>> work.#!/usr/bin/perl

use warnings;
use strict;

use utf8;
binmode STDOUT, ":utf8";

my $ushi = "馬鹿";
print "$ushi\n" if ($ushi =~ /^..$/);

#my $牛 = "馬鹿";
#print "$牛\n" if ($牛 =~ /../);<<<<<<<<<<<<<<<<<<<< End of example program.

If you really think the "use utf8;" pragma is unnecessary for practical 
programming, I'd invite you to play with this by commenting out the "use 
utf8;" and see what Perl (my version is 5.8.6) actually does. If you are 
feeling very bold, you could also try commenting out the "use utf8;" line 
and uncommenting the two commented out "my $牛" lines as well. If you want 
to try reading from or writing to a utf8 file, that might be interesting as 
well without the "binmode".

Anyway, let me repeat that I don't see any disagreement at all between what 
I wrote and the contents of the page you mentioned. The problem is that 
people working on code internals (which is what the page actually seems to 
be talking about) might not be the best people to describe what Perl 
actually does from a user point of view.

B. Bullock.



		
___________________________________________________________ 
How much free photo storage do you get? Store your holiday 
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links