Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- Date: Thu, 16 Feb 2006 19:14:33 +0900
- From: "Stephen J. Turnbull" <stephen@example.com>
- Subject: Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- References: <43F12D6F.1020706@example.com><30ce84360602141624p348b3cacm@example.com><873bilfl15.fsf@example.com><30ce84360602152252q1aea193ci@example.com><30ce84360602152253sb631144j@example.com>
- Organization: The XEmacs Project
- User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b23 (daikon, linux)
>>>>> "Ian" == Ian Wells <ijw@example.com> writes: Ian> So, the distinction being between an object that is a string Ian> of values representing text, and an object which is a string Ian> of values representing a string of values. That's right. Ian> Both of which, I presume, work in most functions (otherwise Ian> the misuse you discuss wouldn't happen) If you are programming for ASCII input, that will be true as long as you're restricted to ASCII input. The problem is that once you leave that world, even for the upward compatible world of UTF-8, you are going to have problems. Ian> And the one you're most likely to use (u"", representing Ian> readable text) is the one that's harder to type. That's right. For backward compatibility reasons. :-( Ian> So Perl doesn't make the distinction and Python doesn't Ian> enforce it properly. That's right. Again, for backward compatibility, Python only enforces it partly. Ian> Personally speaking, Well, whatever floats your boat, of course. If the programmer is comfortable with a given discipline, why bother making a rule that says you have to do it right when he already does? The problem is when you deal with many programmers who prefer different disciplines, or may be undisciplined but it doesn't hurt in the original environment, you're going to have portability problems, and POLA violations when the software gets into users' hands. Ian> I'd argue that since binary data is actually fairly uncommon, "Actually", it's all over the place. The first couple dozen bytes of most XML input should be considered binary, then reread. RFC 2822 headers are binary (EBCDIC and UTF-16 not allowed! and "AW:" is not the German translation of "Re:", "Re:" is the German translation of "Re:"). SMTP, of course, NNTP, HTTP, the list goes on and on. Basically, anything that is a wire protocol is binary in the relevant sense. So you can't simply say "we will represent strings of 8-bit values as an array of 16-bit values" (well, you can, but it would be horribly inefficient to map from memory buffers to WC string buffers all the time). Ian> Um. I'm just saying that (in my head and in Perl) a string Ian> is a string is a string. If you consider it mentally to be a Ian> list of numbers then it can contain either a language string Ian> or a binary data chunk without violating that assumption. But you see, you *can't* think of a string as a list of numbers. Eg, consider case-insensitive matching. *This is nonsense in the binary context.* In the case of Unicode, a program *must* identify (for collation purposes) U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE and U+212B ANGSTROM SIGN, and both of those with the composition of U+0041 LATIN CAPITAL LETTER A plus U+030A COMBINING RING ABOVE. I really don't see how to go from "a list of numbers is a list of numbers" to DWIMming the case above. This is a *real* case, reported within the last couple of weeks here on TLUG (Kevin Hoang's post about getting his Vietnamese accents decomposed). My guess is that the Perl community will spend the next few years fumbling about with "cut and try". Ian> in my experience the result is that you read a utf8 file by Ian> setting the utf8 flag, write it similarly and what you do Ian> inbetween Just Works because you're dealing with strings that Ian> can contain all unicode characters, not bytearrays. That's not surprising. Perl has had twenty years to work its way through the pain of making byte arrays DWYM in string contexts. But they didn't DWDM because David didn't think like a Perl program. My suggestion was that Python might match his expectations better, and he replied he's comfortable learning the Perl Way (or at least one of the Perl万道 ;-). Ian> I suppose it depends on your expectations. I like the 'my Ian> file's in unicode and my language understands that' approach; Ian> I don't see why you'd want a file you edit in unicode only Ian> for your language to consider it to be something else. You wouldn't. The problem is that there are binary protocols that look like text, and there are binary protocols that represent text, and DWIM is always a guess. As David discovered. Ian> And that still seems to suggest that pretty much every string Ian> you're ever going to type into Python would need Ian> u"".toUnicode() (or whatever) when Perl would DWIM. Of course not. It's simply that Python gives you the option to do it at the site (which I guess Perl does too, although the Python notation allows you to use a string method, thus emphasizing the string literal and not the coding method), and it doesn't allow you to do stuff like use utf8; $var = $_ + "ユニコードリテラル"; More precisely, it coerces $_ to Unicode according to the default codec which is normally ASCII-only. What would Perl do if $_ happened to contain KOI8-R-encoded Cyrillic? Just glom them together and cause a utf8 error eventually? -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
- References:
- Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- From: David Riggs
- Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- From: Ian Wells
- Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- From: Stephen J. Turnbull
- Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- From: Ian Wells
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] OT: Digital SLR camera shopping
- Next by Date: Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- Previous by thread: Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- Next by thread: Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links