Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Japanese regex question

>>>>> "Ben" == Ben K Bullock <> writes:

    Ben> it's virtually impossible to miss where a Unicode character
    Ben> starts.

If you're trying to find them.  But you advocate not looking for them
by default!  That is what I'm saying is a mistake.

    Ben> Mule didn't do Unicode for a long time, certainly ten years
    Ben> ago there was absolutely no support for it.

Mule has a universal coded character set (UCS).  No, it's not Unicode
and yes, it's a mistake that it wasn't Unicode.  But the principles of
programming with it are identical to those of UTF-8.

    Ben> It still is not working perfectly, unfortunately.

If you're referring to Unicode handling per se, it's not relevant to
this discussion.  What matters is the fact that Mule has successfully
used an abstractly similar UCS internally for over a decade.  For
internal purposes this UCS has proven to be as bullet-proof as Unicode
as used by Perl (a UTF-8 variant, IIRC) and Python (UCS-2), and it
precedes them by almost 10 years.

That is, almost all of the problems due to Mule have to do with (a)
codecs and (b) GNU Emacs's failure to separate the integer type from
the character type.  There have been a few problems with Mule-specific
features in regexps but they were resolved 7 or 8 years ago (in
XEmacs, anyway).

Based on that example, Perl and Python should have defaulted to
Unicode as the internal encoding, rectifying Mule's mistake in choice
of internal coded character set but keeping the good aspects of the

    Ben> Switch it on with "use utf8" and regular expressions,
    Ben> strings, etc. all work as if one Unicode character was
    Ben> exactly equivalent to one byte.

Same for Mule, except that there's no on-off switch.  Since 1992.[1]

    Ben> The developers are going to make even "use utf8" unnecessary,
    Ben> it's said.

They (and the designers of Python) should have done that immediately.
That's all I said.

    Ben> The only tricky thing with Perl is input and output: it's
    Ben> necessary to specify that an input or output file is utf-8
    Ben> using binmode, otherwise it gets treated as bytes.

With Mule, you don't have to do that.  Sometimes you do need to
specify that files are binary, of course.  But even that is fairly
rare.  (Of course you have to disambiguate different flavors of
ISO-8859 or EUC more or less by hand, but UTF-8 vs everything else is
very accurately detected.)

    Ben> However, if you think about it, that is necessary, since
    Ben> there is no guarantee that every file input or output must be
    Ben> in utf-8 format, and forcing everything to be read as utf-8
    Ben> would cause errors on non-utf-8 binary files.

It's not necessary.  That's what exception handlers are for.  And most
application programmers would never need to write such an exception
handler; there would be a standard "autodetection" codec that handled
this for them.[2]

Of course if you know they're non-UTF-8 binary, then explicitly
forcing binary makes a lot of sense.  That's what I advocate.

[1]  Well, admittedly when RMS finally approved the Mule merge into
GNU Emacs, he insisted on putting in an on-off switch.  But he's
learned the error of his ways---there won't be one in Emacs 23.  And
XEmacs has never had one, except at compile time.

[2]  If your encoding-specific codecs error properly on nonsense octet
sequences, you could even implement this as a simple loop over a list
of encoding-specific codecs at some expense of buffer space and time,
neither of which is noticable in 99% of Emacs applications.  I would
bet that 75% of Python and Perl applications would be the same.

School of Systems and Information Engineering
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links