Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Japanese regex question
- Date: Mon, 29 Aug 2005 16:46:03 +0900
- From: "Ben K. Bullock" <benkasminbullock@example.com>
- Subject: Re: [tlug] Japanese regex question
- References: <200508241701.55144.jq@example.com><20050825183913.O88704@example.com><200508251253.47083.jq@example.com><20050826113217.J88704@example.com> <87zmr2me23.fsf@example.com>
> We had a post on this kind of issue (ambiguous matches in UTF-8) a > couple months back, too. It's worth trying to remember this one. I don't understand how it's possible to have an ambiguous UTF-8 match. It's always clear which byte of a UTF-8 string is the first of each character. It's a stateless encoding. The original poster was using iso-2022-jp where you have to keep track of the states - what was the most recent escape sequence? He could just convert to a stateless encoding like UTF-8 or even EUC-JP using iconv (in C) or Text::Iconv module in Perl, and he wouldn't have the problem any more. http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 should explain why it's virtually impossible to miss where a Unicode character starts. > This is all so stupid. XEmacs has been doing this (badly) for almost > a decade, Mule for another 3 or 4 years longer than that. Why Perl > and Python failed to seize the opportunity to do it right when they > added Unicode support I'll never know. Mule didn't do Unicode for a long time, certainly ten years ago there was absolutely no support for it. I was corresponding with the author of Mule, Mr Handa, about eight years ago, and he had no support at all for Unicode then. He was using his own internal coding called "emacs-mule" to get all the different Chinese, Japanese and Korean character sets working. It still is not working perfectly, unfortunately. On the other hand, Perl 5.8's Unicode support is done exactly right as far as I can see. Switch it on with "use utf8" and regular expressions, strings, etc. all work as if one Unicode character was exactly equivalent to one byte. The developers are going to make even "use utf8" unnecessary, it's said. I've no idea about Python. The only tricky thing with Perl is input and output: it's necessary to specify that an input or output file is utf-8 using binmode, otherwise it gets treated as bytes. However, if you think about it, that is necessary, since there is no guarantee that every file input or output must be in utf-8 format, and forcing everything to be read as utf-8 would cause errors on non-utf-8 binary files. B. Bullock. ___________________________________________________________ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
- Follow-Ups:
- Re: [tlug] Japanese regex question
- From: Stephen J. Turnbull
- References:
- [tlug] Japanese regex question
- From: Jonathan Byrne
- Re: [tlug] Japanese regex question
- From: Tod McQuillin
- Re: [tlug] Japanese regex question
- From: Jonathan Byrne
- Re: [tlug] Japanese regex question
- From: Tod McQuillin
- Re: [tlug] Japanese regex question
- From: Stephen J. Turnbull
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Japanese regex question
- Next by Date: Re: [tlug] GUI font tools
- Previous by thread: Re: [tlug] Japanese regex question
- Next by thread: Re: [tlug] Japanese regex question
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links