Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] unicode and Perl- how to pass command line unicodearguments



Stephen J. Turnbull wrote:

> You might be happier with Python (or some other language with similar
> design).  Python has separate types for byte strings and Unicode
> strings.  Unicode literals are a bit of an annoyance since you have to
> do something like
> 
>     var = "Yes, this is valid UTF-8!".unicode('utf-8')
> 
> but if you're generally reading from files you can set the default
> codec to the appropriate UTF, and you "just read" from the files and
> everything "just works."  

in python byte-strings are objects and unicode-strings are objects too.
you create a byte string for example like this:

string1 = "byte string"

an unicode string:

string2 = u"byte string"

to convert between unicode and byte-strings you can use the 
encode/decode methods of the strings.

decode decodes a byte-strings into unicode, and encode encodes an 
unicode string into a byte-string.

unicodestrings = "byte string".decode('utf-8')
bytestring = u"unicode string".encode('utf-8')

you can also use the unicodestring object's constructor that takes a 
bytestring and a charset specifier, like:


unicodestring = unicode("byte string","utf-8")
which is the same as doing
unicodestring = "byte string".decode('utf-8')

one problematic "feature" is that when you concatenate a bytestring and 
an unicodestring, python will convert the bytestring into unicode usint 
the 'default encoding'. which is 'ascii'.
this default-encoding can be changed by editing a python file in your 
python distribution, but it is not recommended, because many packages 
rely on the fact that default encoding is 'ascii'.
this feature is problematic, because sometimes you write code or use 
modules that convert the bytestring to unicodestrings, but while you 
only use english characters, it works transparently.

then, in production, of course it breaks down the first time you use a 
non-ascii character.



> The basic principle is that all your
> workhorse functions should assume (and check for, if they can be
> called at higher levels) Unicode as input.  Everything should be
> converted to Unicode _explicitly_ as early as possible.
> 

100% agree


gabor


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links