Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Japanese regex question
- Date: Mon, 02 Jan 2006 19:31:07 +0900
- From: "Stephen J. Turnbull" <stephen@example.com>
- Subject: Re: [tlug] Japanese regex question
- References: <200508241701.55144.jq@example.com><20050825183913.O88704@example.com><200508251253.47083.jq@example.com><20050826113217.J88704@example.com><87zmr2me23.fsf@example.com><43B85806.7030909@example.com>
- Organization: The XEmacs Project
- User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b24 (dandelion, linux)
>>>>> "Gábor" == Gábor Farkas <gabor@example.com> writes: Gábor> Stephen J. Turnbull wrote: >> This is all so stupid. XEmacs has been doing this (badly) for >> almost a decade, Mule for another 3 or 4 years longer than >> that. Why Perl and Python failed to seize the opportunity to >> do it right when they added Unicode support I'll never know. Gábor> sorry to jump in so late... Gábor> could you please describe to me what is Python doing wrong Gábor> regarding unicode? Nothing. It's what it doesn't do that's unfortunate. What Emacs does (uniquely, as far as I know) is to convert _everything_ internally to a UCS (currently not Unicode, but both major forks will have experimental "Unicode Inside" code bases generally available within 6 months, I would guess). Of course you can specify the external coding as "binary", if you like, but you MUST specify it. XEmacs went a step further, and separated the character type from the integer type (unlike Python but like C, character is an integral type, not a string of length 1). On the contrary, with Python's Unicode support (including PEP 263), they explicitly decided to grandfather existing applications that import C strings in various encodings, and allow them to coexist with Unicode strings. This is allegedly for backward compatibility, but XEmacs has proved (for ten years, now) that there is no backward compatibility problem (ie, a Mule-enabled XEmacs can run a Mule-blind program with no problem).[1] It's true that there are a number of design bugs in the Python codecs. For example, the UTF-16 string codecs always prepend the BOM, so when you concatenate them you get "<BOM>text<BOM>text", which should never happen. The BOM and/or UTF signature is not for use within a single application, it's for interoperation. So what should happen is that when you open a stream (eg a file or a pipe), the open routine should send a BOM/signature. Obviously this is easy enough to work around (which is why above I wrote "does nothing wrong"), but you can see an unseemly degree of DWIMble-mindedness in the Unicode stuff (not surprising, it was all written and specified by Windows-bound developers, and not of the quality of Wicked Uncle Timmy, either). The BDFL acknowledged the utter righteousness of this view at the time, but caved to the "backward compatibility" crowd. Python 3000 will get it right though. Guido is sick and tired of the FAQs that resulted (which were predictable and predicted<wink>). Footnotes: [1] I specify XEmacsen rather than "Mule" or "Emacsen" because XEmacs has a compile time switch to include Mule or not, so (unlike the old NEmacs and Mule patches or modern GNU Emacs) the same code must run in both environments. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
- References:
- Re: [tlug] Japanese regex question
- From: Gábor Farkas
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] [tlug-digest] Mozilla printing. No joy. Isn't there somegood Mozilla doc about printing?
- Next by Date: Re: [tlug] Japanese input - Xemacs side effect
- Previous by thread: Re: [tlug] Japanese regex question
- Next by thread: [tlug] O3: The Open Source Enterprise Data Networking Magazine
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links