Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- Date: Mon, 13 Feb 2006 19:50:47 +0100
- From: gabor <gabor@example.com>
- Subject: Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- References: <43EFF8C4.4050704@example.com><87d5hrix3v.fsf@example.com>
- User-agent: Thunderbird 1.5 (Macintosh/20051201)
Stephen J. Turnbull wrote: > You might be happier with Python (or some other language with similar > design). Python has separate types for byte strings and Unicode > strings. Unicode literals are a bit of an annoyance since you have to > do something like > > var = "Yes, this is valid UTF-8!".unicode('utf-8') > > but if you're generally reading from files you can set the default > codec to the appropriate UTF, and you "just read" from the files and > everything "just works." in python byte-strings are objects and unicode-strings are objects too. you create a byte string for example like this: string1 = "byte string" an unicode string: string2 = u"byte string" to convert between unicode and byte-strings you can use the encode/decode methods of the strings. decode decodes a byte-strings into unicode, and encode encodes an unicode string into a byte-string. unicodestrings = "byte string".decode('utf-8') bytestring = u"unicode string".encode('utf-8') you can also use the unicodestring object's constructor that takes a bytestring and a charset specifier, like: unicodestring = unicode("byte string","utf-8") which is the same as doing unicodestring = "byte string".decode('utf-8') one problematic "feature" is that when you concatenate a bytestring and an unicodestring, python will convert the bytestring into unicode usint the 'default encoding'. which is 'ascii'. this default-encoding can be changed by editing a python file in your python distribution, but it is not recommended, because many packages rely on the fact that default encoding is 'ascii'. this feature is problematic, because sometimes you write code or use modules that convert the bytestring to unicodestrings, but while you only use english characters, it works transparently. then, in production, of course it breaks down the first time you use a non-ascii character. > The basic principle is that all your > workhorse functions should assume (and check for, if they can be > called at higher levels) Unicode as input. Everything should be > converted to Unicode _explicitly_ as early as possible. > 100% agree gabor
- Follow-Ups:
- Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- From: Stephen J. Turnbull
- References:
- Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- From: David Riggs
- Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- From: Stephen J. Turnbull
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- Next by Date: Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- Previous by thread: Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- Next by thread: Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links