Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Unicode support in classic Unix programs including Python. . . . . . . . . (was: Re: Learn a Variety of Languages) [tlug]



Jim writes:

 > With UTF-8, designed by UNIX guru Ken Thompson, 
 > Unicode (in UTF-8) plays well with most Unix/Linux software. 
 > That should include Python. 

Nope.  Python's internal code is platform-endian UTF-16.  It's the
codec library that provides the necessary translations.

 > 
 >    http://en.wikipedia.org/wiki/UTF-8
 > 
 > Regexes _do_ become "interesting" in Unicode. 

Actually, not.  Just convert to UTF-8 to avoid zero bytes that tend to
confuse C, and wrap parens around every character, which is obviously
a non-starter as a programmer interface, and just as obviously
essentially trivial to do with a preprocessor.  Then use any
byte-oriented regexp engine you like.  Of course this process can be
vastly optimized.

It's true that doing character classes well is an interesting
programming task, but we have to do hard work there even for ASCII
because of the POSIX standard which says that ranges for character
classes should be defined in terms of the collation order.

 > If Python has an advantage over other programming languages 
 > regarding CJK, I would expect that advantage to be related to 
 > regexes and/or sorting.

Personally, I use Python because it fits my style, not because it's
better at CJK than anything else.  There are apparently some
interesting techniques used to compact character classes and the like
(this is important in Python because not only are compiled regexps
designed to be persistent, but Python "on-the-fly" search functions
cache 100 compiled regexps as attributes of the string representation
---you're likely to have a *lot* of regexps lying around in a large
Python program).

I will remark that there are two more important general considerations.
The first is support for message catalogs.  Python is pretty good (on
a scale of 1-10, not in comparison to any other language) for message
translation support.  Second is internationalization of the standard
library.  Python's standard library, and many apps written in Python
(such as Mailman) are well-internationalized already.

 >    http://en.wikipedia.org/wiki/Han_unification#Check_your_browser

Firefox does OK.

I browsed around that page.  It evidently was written by a
non-specialist; it makes many elementary mistakes in discussing the
standards themselves.  Seems basically balanced, although some of the
missing information is important.  For example, the ISO 2022 standard
goes back to ECMA 35, originally drafted in the early 1970s (!), and
has never been widely implemented as an I18N technique.  The most
important example is X11 Compound Text.  Closely related are TRON code
and Emacs/MULE code, but neither is actually a profile of ISO 2022.

Interestingly enough, both TRON code and Emacs/MULE code have trouble
with European languages, because they dis-unify the Latin-X sets,
including ISO 8859/1 and ISO 8859/15!

In fact, all of the alternatives mentioned (except UTF-2000, which
isn't a serious contender as anything but a linguistics research tool)
long predate the diffusion of Unicode.  The fact that none were ever
very successful in supporting widespread internationalization while
Unicode has made steady, increasingly rapid, progress since about 1990
should be considered pretty damning for the older techniques.

Another thing that is worth remembering about Unicode is that there is
no requirement that any given character be supported, except that it
must not be corrupted in transmission.  What this means is that a
3-byte or 4-byte code that supported Japanese and Chinese separately
would very likely devolve into a family of related, but mutually
unintelligible, national codes like those we already have.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links