
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Re: Piping stderr?
>>>>> "Jiro" == Jiro SEKIBA <jir@example.com> writes:
Jiro> Unicode is not THE codeset, but ONE OF CODESETS.
I think you have been listening to Ohta-san's propaganda for too long.
Unicode doesn't have "lots" of characters. CNS 11643 has "lots" of
characters, more than Unicode 2.1 (and probably more than 3.2, but I
haven't checked). Unicode _is_ THE Universal Character Set (UCS).
Plus a whole bunch of essential algorithms for handling text.
And if you really really have to have some character (or character
set) that Unicode doesn't provide, there are a hundred thousand
private space code points reserved for _you personally_. What's the
problem?
Jiro> I don't understand what you mean 'when', but it will just
Jiro> automatically fallback into 'C' locale, and continue
Jiro> working. If it can't fallback into C", C library is broken,
Jiro> it means whole system already downed ;-).
Exactly. But one of the reasons it doesn't fallback into C may very
well be because the I18N library _thinks_ it's OK, but it's broken.
This is not sufficient reason for my system to crash; dunno how you
feel about that....
Jiro> If so, this IS the UTF-8 hard coded programs issue.
Who said "hard code" UTF-8? In fact, I don't need that the programs
I'm talking about to _ever_ interpret UTF-8. They interpret ASCII;
anything containing non-ASCII is part of a string or a comment, and
will be passed on verbatim or ignored. Validation, if necessary,
should be done by other programs or the library functions called. All
that needs to be hard-coded is recognition of a character:
/* yes, I know there are much faster table-driven ways to do this */
if (*p & 0x80 == 0x00) /* ASCII */
length = 1;
else if (*p & 0xE0 == 0xC0) /* multibyte */
length = 2;
else if (*p & 0xF0 == 0xE0)
length = 3;
else if (*p & 0xF8 == 0xF0)
length = 4;
else if (*p & 0xFC == 0xF8)
length = 5;
else if (*p & 0xFE == 0xFC)
length = 6;
else /* illegal first byte, including 10xxxxxx */
abort();
This is not rocket science, and it is not going to change, ever.
Jiro> If you have ten UTF-8 hard coded programs, you have to fix
Jiro> each programs. On the other hand, on CSI design just fix
Jiro> library. Programs don't need to be modified anything.
Wrong. Dangerous, ugly stuff like Shift JIS will be wandering around
_inside_ my program. To handle it correctly, I will need extra code.
In _all_ my CSI programs.
Jiro> Even if this is not what you mentioned, it shows the bad
Jiro> thing of UTF-8 hard code programs.
I'm not advocating doing _anything_ by hard-coding in each program.
I'm advocating that simple applications that need to be robust should
restrict themselves to a single small library intended to do just one
well-defined thing well: process Unicode character streams, character
by character. No bidi, no composed characters, no interpretation of
surrogates (illegal in UTF-8 but I don't need to care). And no
steenkin' Shift JIS, Big Five, or NEC kanji.
Jiro> ##http://support.microsoft.com/default.aspx?scid=%2Fisapi%2Fgomscom%2Easp%3Ftarget%3D%2Fjapan%2Fsupport%2Fkb%2Farticles%2Fjp170%2F5%2F59%2Easp&LN=JA
Jiro> ###BTW I do not much care about this Win issue ;-p, it's
Jiro> just a example.
Oh, _that_. Of course you can't round trip when the coded character
set _intentionally_ provides multiple code points for the same
character. Unless you go out of your way to cater to the brain-damage
(cf full/half-width compatibility character in FF row of the BMP).
This is _exactly_ the kind of junk you don't have to worry about if
you restrict internal text to Unicode.
>> Other than that, there are no efforts I know of. Again, be
>> concrete.
Jiro> I'm NOT talking about the problem of NOW. Who knows it's
Jiro> never happen?
Are you talking about SETI?[1] No sane earthling will design a
character set to be incompatible with Unicode ever again.
Jiro> What seems to be the problem?
CSI means that arbitrarily stupid character encodings (Shift JIS is a
leading example) can get inside my program. This means that _my_
program needs to deal with _their_ brain damage. I don't want my
program to ever deal with Shift JIS. If my users want to see Shift
JIS, I'll translate at the program boundary. I don't have a problem
with that.
Fewer lines of code, fewer libraries, means fewer things to go wrong.
Footnotes:
[1] Search for Extra-Terrestrial Intelligence.
--
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
My nostalgia for Icon makes me forget about any of the bad things. I don't
have much nostalgia for Perl, so its faults I remember. Scott Gilbert c.l.py
Home |
Main Index |
Thread Index