Re: [tlug] Japanese regex question

Date: Mon, 02 Jan 2006 19:31:07 +0900
From: "Stephen J. Turnbull" <stephen@example.com>
Subject: Re: [tlug] Japanese regex question
References: <200508241701.55144.jq@example.com><20050825183913.O88704@example.com><200508251253.47083.jq@example.com><20050826113217.J88704@example.com><87zmr2me23.fsf@example.com><43B85806.7030909@example.com>
Organization: The XEmacs Project
User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b24 (dandelion, linux)

>>>>> "Gábor" == Gábor Farkas <gabor@example.com> writes:

    Gábor> Stephen J. Turnbull wrote:

    >> This is all so stupid.  XEmacs has been doing this (badly) for
    >> almost a decade, Mule for another 3 or 4 years longer than
    >> that.  Why Perl and Python failed to seize the opportunity to
    >> do it right when they added Unicode support I'll never know.

    Gábor> sorry to jump in so late...

    Gábor> could you please describe to me what is Python doing wrong
    Gábor> regarding unicode?

Nothing.  It's what it doesn't do that's unfortunate.

What Emacs does (uniquely, as far as I know) is to convert
_everything_ internally to a UCS (currently not Unicode, but both
major forks will have experimental "Unicode Inside" code bases
generally available within 6 months, I would guess).  Of course you
can specify the external coding as "binary", if you like, but you MUST
specify it.  XEmacs went a step further, and separated the character
type from the integer type (unlike Python but like C, character is an
integral type, not a string of length 1).

On the contrary, with Python's Unicode support (including PEP 263),
they explicitly decided to grandfather existing applications that
import C strings in various encodings, and allow them to coexist with
Unicode strings.  This is allegedly for backward compatibility, but
XEmacs has proved (for ten years, now) that there is no backward
compatibility problem (ie, a Mule-enabled XEmacs can run a Mule-blind
program with no problem).[1]

It's true that there are a number of design bugs in the Python codecs.
For example, the UTF-16 string codecs always prepend the BOM, so when
you concatenate them you get "<BOM>text<BOM>text", which should never
happen.  The BOM and/or UTF signature is not for use within a single
application, it's for interoperation.  So what should happen is that
when you open a stream (eg a file or a pipe), the open routine should
send a BOM/signature.

Obviously this is easy enough to work around (which is why above I
wrote "does nothing wrong"), but you can see an unseemly degree of
DWIMble-mindedness in the Unicode stuff (not surprising, it was all
written and specified by Windows-bound developers, and not of the
quality of Wicked Uncle Timmy, either).

The BDFL acknowledged the utter righteousness of this view at the
time, but caved to the "backward compatibility" crowd.  Python 3000
will get it right though.  Guido is sick and tired of the FAQs that
resulted (which were predictable and predicted<wink>).

Footnotes: 
[1]  I specify XEmacsen rather than "Mule" or "Emacsen" because XEmacs
has a compile time switch to include Mule or not, so (unlike the old
NEmacs and Mule patches or modern GNU Emacs) the same code must run in
both environments.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

References:
- Re: [tlug] Japanese regex question
  - From: Gábor Farkas

Prev by Date: Re: [tlug] [tlug-digest] Mozilla printing. No joy. Isn't there somegood Mozilla doc about printing?
Next by Date: Re: [tlug] Japanese input - Xemacs side effect
Previous by thread: Re: [tlug] Japanese regex question
Next by thread: [tlug] O3: The Open Source Enterprise Data Networking Magazine
Index(es):
- Date
- Thread

Home | Main Index | Thread Index