Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Unicode support in classic Unix programs including Python. . . . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- Date: Wed, 17 Jan 2007 14:39:42 +0900
- From: "Stephen J. Turnbull" <stephen@example.com>
- Subject: Unicode support in classic Unix programs including Python. . . . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- References: <45AAFDA9.90504@example.com> <20070115091710.465a8f0b.jep200404@example.com> <op.tl694ax0p3esx5@example.com> <78d7dd350701151659n104a93e3v6cd6cb936f4459ed@example.com> <20070115212859.0251adf4.jep200404@example.com> <157AA731-9AC1-4FEF-ABAD-23A5BE8F0C05@example.com> <Pine.NEB.4.64.0701161628390.12325@example.com> <78d7dd350701160014h40155a75n345183640cbccfc5@example.com> <Pine.NEB.4.64.0701161825530.14637@example.com> <20070116090544.3dc92410.jep200404@example.com>
Jim writes: > With UTF-8, designed by UNIX guru Ken Thompson, > Unicode (in UTF-8) plays well with most Unix/Linux software. > That should include Python. Nope. Python's internal code is platform-endian UTF-16. It's the codec library that provides the necessary translations. > > http://en.wikipedia.org/wiki/UTF-8 > > Regexes _do_ become "interesting" in Unicode. Actually, not. Just convert to UTF-8 to avoid zero bytes that tend to confuse C, and wrap parens around every character, which is obviously a non-starter as a programmer interface, and just as obviously essentially trivial to do with a preprocessor. Then use any byte-oriented regexp engine you like. Of course this process can be vastly optimized. It's true that doing character classes well is an interesting programming task, but we have to do hard work there even for ASCII because of the POSIX standard which says that ranges for character classes should be defined in terms of the collation order. > If Python has an advantage over other programming languages > regarding CJK, I would expect that advantage to be related to > regexes and/or sorting. Personally, I use Python because it fits my style, not because it's better at CJK than anything else. There are apparently some interesting techniques used to compact character classes and the like (this is important in Python because not only are compiled regexps designed to be persistent, but Python "on-the-fly" search functions cache 100 compiled regexps as attributes of the string representation ---you're likely to have a *lot* of regexps lying around in a large Python program). I will remark that there are two more important general considerations. The first is support for message catalogs. Python is pretty good (on a scale of 1-10, not in comparison to any other language) for message translation support. Second is internationalization of the standard library. Python's standard library, and many apps written in Python (such as Mailman) are well-internationalized already. > http://en.wikipedia.org/wiki/Han_unification#Check_your_browser Firefox does OK. I browsed around that page. It evidently was written by a non-specialist; it makes many elementary mistakes in discussing the standards themselves. Seems basically balanced, although some of the missing information is important. For example, the ISO 2022 standard goes back to ECMA 35, originally drafted in the early 1970s (!), and has never been widely implemented as an I18N technique. The most important example is X11 Compound Text. Closely related are TRON code and Emacs/MULE code, but neither is actually a profile of ISO 2022. Interestingly enough, both TRON code and Emacs/MULE code have trouble with European languages, because they dis-unify the Latin-X sets, including ISO 8859/1 and ISO 8859/15! In fact, all of the alternatives mentioned (except UTF-2000, which isn't a serious contender as anything but a linguistics research tool) long predate the diffusion of Unicode. The fact that none were ever very successful in supporting widespread internationalization while Unicode has made steady, increasingly rapid, progress since about 1990 should be considered pretty damning for the older techniques. Another thing that is worth remembering about Unicode is that there is no requirement that any given character be supported, except that it must not be corrupted in transmission. What this means is that a 3-byte or 4-byte code that supported Japanese and Chinese separately would very likely devolve into a family of related, but mutually unintelligible, national codes like those we already have.
- References:
- [tlug] What is the most appropriate scripting language
- From: Dave M G
- Bourne Shell is the most appropriate scripting language (was Re: [tlug] What is the most appropriate scripting language)
- From: Jim
- [tlug] Re: Bourne Shell is the most appropriate scripting language
- From: Greg Thomson
- Re: [tlug] Re: Bourne Shell is the most appropriate scripting language
- From: Nguyen Vu Hung
- Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- From: Jim
- Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- From: Jean-Christophe Helary
- Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- From: Curt Sampson
- Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- From: Nguyen Vu Hung
- Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- From: Curt Sampson
- Unicode support in classic Unix programs including Python. . . . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- From: Jim
Home | Main Index | Thread Index
- Prev by Date: Re: RE : Re: [tlug] "strange antipathy towards Unicode" . . . . . . . . (was: Re: Learn a Variety of Languages)
- Next by Date: Re: [tlug] What is the most appropriate scripting language
- Previous by thread: Re: Unicode support in classic Unix programs including Python. . . . . . . . . (was: Re: Learn a Variety of Languages) [tlug]
- Next by thread: Re: Learn a Variety of Languages . . . . . . . (was: Re: [tlug] Re: Bourne Shell is the most appropriate scripting language)
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links