TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Japanese regex question

Date: Mon, 29 Aug 2005 00:25:39 +0900

From: "Stephen J. Turnbull" <stephen@example.com>

Subject: Re: [tlug] Japanese regex question

References: <200508241701.55144.jq@example.com><20050825183913.O88704@example.com><200508251253.47083.jq@example.com><20050826113217.J88704@example.com><87zmr2me23.fsf@example.com><30ce843605082808003eac8faa@example.com>

Organization: The XEmacs Project

User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b21 (corn, linux)
>>>>> "Ian" == Ian Wells <ijw@example.com> writes:

    Ian> Can you explain how Perl's doing it wrong?  Works for me
    Ian> (tm)...

I should start by saying that I know Python got this wrong, and that
(from the description so far) it sounds like Perl did, too.

Basically, all text should be in Unicode by default.  The program
source, including literal strings, should be in Unicode, and all I/O
should be run through codecs that attempt to convert to Unicode, and
error at the slightest whiff of incorrect coding.

Of course there are plenty of good applications for using "text
processing" on streams of raw bytes, but these are normally fairly
localized (ie to a single module) and specialized (so you'd expect the
programmers to know what they're doing), and it would be reasonable to
use an awkward interface (such as a flag to the function/operator
telling the interpreter to assume raw bytes rather than UTF-8 or
UTF-16, and not try to decode).

If you have legacy code that would be expensive to convert properly
for one reason or another, then there would be a "use LaxText"
declaration inverting the sense of the flag (ie, for that module the
default would be to assume octet streams rather than character
streams, and decoding would be off by default).  But this should be
done on a module-local basis.  Of course there would be a flag to the
interpreter itself (we're all consenting adults, here), but people who
use it deserve what they get (hopefully, fired ;-).

All of this is going to have to be done someday; all that the half-way
approach that Python used accomplishes is to (1) hide the real problem
from a relatively small number of users who will occasionally get
bitten, and (2) encourage ever more programmers to continue to write
Unicode-oblivious code that will someday bite their users.  My guess
is that Perl did it the same way.


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.
Follow-Ups:

Re: [tlug] Japanese regex question
From: Botond Botyanszki

References:

[tlug] Japanese regex question
From: Jonathan Byrne

Re: [tlug] Japanese regex question
From: Tod McQuillin

Re: [tlug] Japanese regex question
From: Jonathan Byrne

Re: [tlug] Japanese regex question
From: Tod McQuillin

Re: [tlug] Japanese regex question
From: Stephen J. Turnbull

Re: [tlug] Japanese regex question
From: Ian Wells

Prev by Date: Re: [tlug] Mozilla 1.8a3 AdBook Problem

Next by Date: Re: [tlug] Japanese regex question

Previous by thread: Re: [tlug] Japanese regex question

Next by thread: Re: [tlug] Japanese regex question

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links