Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] unicode and Perl- how to pass command line unicodearguments



>>>>> "David" == David Riggs <dariggs@example.com> writes:

    David> Somehow your suggested utf8::decode($x) only returns a
    David> "1", presumably for success, and I do not see how to get
    David> it to return the value.

As David E points out, it's doing its work in place.  Not good.

    David> Very mystifying.

Not really, if you understand what's actually happening.  The main
thing is to disabuse yourself of the notion that anything that's
useful for real programming work can "just work" with Unicode (or with
anything; be thankful you only have to deal with Unicode and not IEEE
754 floating point!)  The basic problem is that languages that have
inherited their way of thinking about text from C always have an
assumption that text == a region of memory built in, and strings are
really just a collection of bytes.

Then people get used to programming as though strings and byte arrays
are the same thing, and you don't know what "this is text" means; is
it an array of 8-bit integers, or is it a UTF-8 stream of characters
of variable width?  So all of these languages allow you to treat
memory regions as strings, and it's the programmer's responsibility
(this means YOU! ;-) to disambiguate.

    David> And I thought perl was supposed to just work with unicode!

You might be happier with Python (or some other language with similar
design).  Python has separate types for byte strings and Unicode
strings.  Unicode literals are a bit of an annoyance since you have to
do something like

    var = "Yes, this is valid UTF-8!".unicode('utf-8')

but if you're generally reading from files you can set the default
codec to the appropriate UTF, and you "just read" from the files and
everything "just works."  The basic principle is that all your
workhorse functions should assume (and check for, if they can be
called at higher levels) Unicode as input.  Everything should be
converted to Unicode _explicitly_ as early as possible.

It's probably possible to program in this style in Perl, too, but Perl
believes that anything that can't be implicit should be made so
obscure that it might as well be implicit---it won't be pleasant. ;-)

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links