Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] unicode and Perl- how to pass command line unicodearguments



Bugger....

---------- Forwarded message ----------
From: Ian Wells <ijw@example.com>
Date: 16-Feb-2006 15:52
Subject: Re: [tlug] unicode and Perl- how to pass command line unicode arguments
To: "Stephen J. Turnbull" <stephen@example.com>

On 15/02/06, Stephen J. Turnbull <stephen@example.com> wrote:
>>>>> "Ian" == Ian Wells <ijw@example.com> writes:
No.  In Python 2.x, there is a natural language text object, which for
historical reasons is called "Unicode" and whose literals are denoted
u"string".  Then there is raw memory, which for historical reasons is
called "string" and whose literals are denoted "string".  For
historical reasons, the raw memory object has continued to be heavily
abused as a container of natural language text.  What I don't like
about Perl, as I understand your description, is that Perl mandates
that abuse (whatever happened to "there's always more than one way to
do it"? :-)

So, the distinction being between an object that is a string of values representing text, and an object which is a string of values representing a string of values.  Both of which, I presume, work in most functions (otherwise the misuse you discuss wouldn't happen) - so the misuse is bound to happen.  And the one you're most likely to use (u"", representing readable text) is the one that's harder to type.

So Perl doesn't make the distinction and Python doesn't enforce it properly.  Personally speaking, I'd argue that since binary data is actually fairly uncommon, I've got no problem with the way Perl works - but I would be much more frustrated if the input layers didn't generally take care of conversions after setup (modally, as you later say).

As Gabor pointed out, there is a flexible way of making Python as
DWIM-witted as Perl.  You can set the encoding for the file in the way
which has become common for many text editors (include Emacsen and
IIRC vim), by putting a specially-formatted comment (aka coding
cookie) at the top of the file.

    Ian> in Perl, I don't have to ever specify u"string".  This is a
    Ian> good thing, in my opinion, because I want strings to be
    Ian> stored as decoded (once I've set the source file coding) and
    Ian> not as binary data 99% of the time, and I'm prepared to use
    Ian> \x.. for the other 1%.

But according to you, this is exactly what Perl doesn't do.  It
decodes the text, then stores it as binary data, and depends on you to
not do something stupid.  

[I suspect I'm not doing a good job of distinguishing between utf, ucs and unicode in this thread, for which I apologise.]

Um.  I'm just saying that (in my head and in Perl) a string is a string is a string.  If you consider it mentally to be a list of numbers then it can contain either a language string or a binary data chunk without violating that assumption.  If you assume that it contains a human-readable bit of text then you're wrong at least some of the time.

Perl works utf8 magic as a means of internal compression - that is, it stores strings in memory as a utf8 bytestream when it thinks it's a good idea, but using Perl accessor functions you never get to see that bytestream because Perl doesn't ever let you read that memory directly byte by byte.

Not that I'm doing a good job of explaining, but in my experience the result is that you read a utf8 file by setting the utf8 flag, write it similarly and what you do inbetween Just Works because you're dealing with strings that can contain all unicode characters, not bytearrays.  Maybe I'm too much a part of the system to see how it could be done better.

This can work, but (a) it depends on
programmer discipline and (b) is modal.  Ie, the "use utf8;"
declaration is at the top of the file which the programmer may or may
not ever look at carefully.

If you embed utf8 in a file and don't put utf8 at the top you deserve a good slapping.  (Perl's a bit bad for boilerplate anyway, so you usually want 'use strict;use warnings' up there for a start.)  But you could do my $s=decode_utf8("binary unicode"); in a plain file, I suppose. (Function name fictional; I always have to look it up.)

And when your strings contain unicode, then most non-unicode behaviours (wide character output to a normal 8 bit file handle, for instance) will get you a complaint.  Or you can use the runtime flag, if you prefer not to state the assumption everywhere.

In fact I would guess that it might actually be in some other file
entirely, since it's part of the language.  

Nope, it affects the current file only unless you use the runtime flag on the command line.

The Python cookie can't be
in another file, since it only refers to the text of the file
currently being read.  Whether Both approaches have serious problems.
Python's is more readable IMO, but the "convert at variable
initialization" approach is the most readable (though verbose).

I suppose it depends on your expectations.  I like the 'my file's in unicode and my language understands that' approach; I don't see why you'd want a file you edit in unicode only for your language to consider it to be something else.

And that still seems to suggest that pretty much every string you're ever going to type into Python would need u"".toUnicode() (or whatever) when Perl would DWIM.

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links