Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Japanese regex question
- Date: Mon, 29 Aug 2005 00:25:39 +0900
- From: "Stephen J. Turnbull" <stephen@example.com>
- Subject: Re: [tlug] Japanese regex question
- References: <200508241701.55144.jq@example.com><20050825183913.O88704@example.com><200508251253.47083.jq@example.com><20050826113217.J88704@example.com><87zmr2me23.fsf@example.com><30ce843605082808003eac8faa@example.com>
- Organization: The XEmacs Project
- User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b21 (corn, linux)
>>>>> "Ian" == Ian Wells <ijw@example.com> writes: Ian> Can you explain how Perl's doing it wrong? Works for me Ian> (tm)... I should start by saying that I know Python got this wrong, and that (from the description so far) it sounds like Perl did, too. Basically, all text should be in Unicode by default. The program source, including literal strings, should be in Unicode, and all I/O should be run through codecs that attempt to convert to Unicode, and error at the slightest whiff of incorrect coding. Of course there are plenty of good applications for using "text processing" on streams of raw bytes, but these are normally fairly localized (ie to a single module) and specialized (so you'd expect the programmers to know what they're doing), and it would be reasonable to use an awkward interface (such as a flag to the function/operator telling the interpreter to assume raw bytes rather than UTF-8 or UTF-16, and not try to decode). If you have legacy code that would be expensive to convert properly for one reason or another, then there would be a "use LaxText" declaration inverting the sense of the flag (ie, for that module the default would be to assume octet streams rather than character streams, and decoding would be off by default). But this should be done on a module-local basis. Of course there would be a flag to the interpreter itself (we're all consenting adults, here), but people who use it deserve what they get (hopefully, fired ;-). All of this is going to have to be done someday; all that the half-way approach that Python used accomplishes is to (1) hide the real problem from a relatively small number of users who will occasionally get bitten, and (2) encourage ever more programmers to continue to write Unicode-oblivious code that will someday bite their users. My guess is that Perl did it the same way. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
- Follow-Ups:
- Re: [tlug] Japanese regex question
- From: Botond Botyanszki
- References:
- [tlug] Japanese regex question
- From: Jonathan Byrne
- Re: [tlug] Japanese regex question
- From: Tod McQuillin
- Re: [tlug] Japanese regex question
- From: Jonathan Byrne
- Re: [tlug] Japanese regex question
- From: Tod McQuillin
- Re: [tlug] Japanese regex question
- From: Stephen J. Turnbull
- Re: [tlug] Japanese regex question
- From: Ian Wells
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Mozilla 1.8a3 AdBook Problem
- Next by Date: Re: [tlug] Japanese regex question
- Previous by thread: Re: [tlug] Japanese regex question
- Next by thread: Re: [tlug] Japanese regex question
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links