Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Japanese regex question
- Date: Tue, 30 Aug 2005 02:27:56 +0900
- From: "Stephen J. Turnbull" <stephen@example.com>
- Subject: Re: [tlug] Japanese regex question
- References: <200508241701.55144.jq@example.com><20050825183913.O88704@example.com><200508251253.47083.jq@example.com><20050826113217.J88704@example.com><87zmr2me23.fsf@example.com><003a01c5ac6d$b6a53420$0b01a8c0@example.com>
- Organization: The XEmacs Project
- User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b21 (corn, linux)
>>>>> "Ben" == Ben K Bullock <benkasminbullock@example.com> writes: Ben> it's virtually impossible to miss where a Unicode character Ben> starts. If you're trying to find them. But you advocate not looking for them by default! That is what I'm saying is a mistake. Ben> Mule didn't do Unicode for a long time, certainly ten years Ben> ago there was absolutely no support for it. Mule has a universal coded character set (UCS). No, it's not Unicode and yes, it's a mistake that it wasn't Unicode. But the principles of programming with it are identical to those of UTF-8. Ben> It still is not working perfectly, unfortunately. If you're referring to Unicode handling per se, it's not relevant to this discussion. What matters is the fact that Mule has successfully used an abstractly similar UCS internally for over a decade. For internal purposes this UCS has proven to be as bullet-proof as Unicode as used by Perl (a UTF-8 variant, IIRC) and Python (UCS-2), and it precedes them by almost 10 years. That is, almost all of the problems due to Mule have to do with (a) codecs and (b) GNU Emacs's failure to separate the integer type from the character type. There have been a few problems with Mule-specific features in regexps but they were resolved 7 or 8 years ago (in XEmacs, anyway). Based on that example, Perl and Python should have defaulted to Unicode as the internal encoding, rectifying Mule's mistake in choice of internal coded character set but keeping the good aspects of the design. Ben> Switch it on with "use utf8" and regular expressions, Ben> strings, etc. all work as if one Unicode character was Ben> exactly equivalent to one byte. Same for Mule, except that there's no on-off switch. Since 1992.[1] Ben> The developers are going to make even "use utf8" unnecessary, Ben> it's said. They (and the designers of Python) should have done that immediately. That's all I said. Ben> The only tricky thing with Perl is input and output: it's Ben> necessary to specify that an input or output file is utf-8 Ben> using binmode, otherwise it gets treated as bytes. With Mule, you don't have to do that. Sometimes you do need to specify that files are binary, of course. But even that is fairly rare. (Of course you have to disambiguate different flavors of ISO-8859 or EUC more or less by hand, but UTF-8 vs everything else is very accurately detected.) Ben> However, if you think about it, that is necessary, since Ben> there is no guarantee that every file input or output must be Ben> in utf-8 format, and forcing everything to be read as utf-8 Ben> would cause errors on non-utf-8 binary files. It's not necessary. That's what exception handlers are for. And most application programmers would never need to write such an exception handler; there would be a standard "autodetection" codec that handled this for them.[2] Of course if you know they're non-UTF-8 binary, then explicitly forcing binary makes a lot of sense. That's what I advocate. Footnotes: [1] Well, admittedly when RMS finally approved the Mule merge into GNU Emacs, he insisted on putting in an on-off switch. But he's learned the error of his ways---there won't be one in Emacs 23. And XEmacs has never had one, except at compile time. [2] If your encoding-specific codecs error properly on nonsense octet sequences, you could even implement this as a simple loop over a list of encoding-specific codecs at some expense of buffer space and time, neither of which is noticable in 99% of Emacs applications. I would bet that 75% of Python and Perl applications would be the same. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
- References:
- [tlug] Japanese regex question
- From: Jonathan Byrne
- Re: [tlug] Japanese regex question
- From: Tod McQuillin
- Re: [tlug] Japanese regex question
- From: Jonathan Byrne
- Re: [tlug] Japanese regex question
- From: Tod McQuillin
- Re: [tlug] Japanese regex question
- From: Stephen J. Turnbull
- Re: [tlug] Japanese regex question
- From: Ben K. Bullock
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Japanese regex question
- Next by Date: Re: [tlug] Japanese regex question
- Previous by thread: Re: [tlug] Japanese regex question
- Next by thread: Re: [tlug] Japanese regex question
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links