Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Re: Piping stderr?



>>>>> "Jiro" == Jiro SEKIBA <jir@example.com> writes:

    Jiro> Unicode is not THE codeset, but ONE OF CODESETS.

I think you have been listening to Ohta-san's propaganda for too long.

Unicode doesn't have "lots" of characters.  CNS 11643 has "lots" of
characters, more than Unicode 2.1 (and probably more than 3.2, but I
haven't checked).  Unicode _is_ THE Universal Character Set (UCS).
Plus a whole bunch of essential algorithms for handling text.

And if you really really have to have some character (or character
set) that Unicode doesn't provide, there are a hundred thousand
private space code points reserved for _you personally_.  What's the
problem?

    Jiro>  I don't understand what you mean 'when', but it will just
    Jiro> automatically fallback into 'C' locale, and continue
    Jiro> working.  If it can't fallback into C", C library is broken,
    Jiro> it means whole system already downed ;-).

Exactly.  But one of the reasons it doesn't fallback into C may very
well be because the I18N library _thinks_ it's OK, but it's broken.
This is not sufficient reason for my system to crash; dunno how you
feel about that....

    Jiro> If so, this IS the UTF-8 hard coded programs issue.

Who said "hard code" UTF-8?  In fact, I don't need that the programs
I'm talking about to _ever_ interpret UTF-8.  They interpret ASCII;
anything containing non-ASCII is part of a string or a comment, and
will be passed on verbatim or ignored.  Validation, if necessary,
should be done by other programs or the library functions called.  All
that needs to be hard-coded is recognition of a character:

/* yes, I know there are much faster table-driven ways to do this */
if (*p & 0x80 == 0x00)          /* ASCII */
  length = 1;
else if (*p & 0xE0 == 0xC0)     /* multibyte */
  length = 2;
else if (*p & 0xF0 == 0xE0)
  length = 3;
else if (*p & 0xF8 == 0xF0)
  length = 4;
else if (*p & 0xFC == 0xF8)
  length = 5;
else if (*p & 0xFE == 0xFC)
  length = 6;
else                            /* illegal first byte, including 10xxxxxx */
  abort();

This is not rocket science, and it is not going to change, ever.

    Jiro> If you have ten UTF-8 hard coded programs, you have to fix
    Jiro> each programs.  On the other hand, on CSI design just fix
    Jiro> library.  Programs don't need to be modified anything.

Wrong.  Dangerous, ugly stuff like Shift JIS will be wandering around
_inside_ my program.  To handle it correctly, I will need extra code.
In _all_ my CSI programs.

    Jiro> Even if this is not what you mentioned, it shows the bad
    Jiro> thing of UTF-8 hard code programs.

I'm not advocating doing _anything_ by hard-coding in each program.
I'm advocating that simple applications that need to be robust should
restrict themselves to a single small library intended to do just one
well-defined thing well: process Unicode character streams, character
by character.  No bidi, no composed characters, no interpretation of
surrogates (illegal in UTF-8 but I don't need to care).  And no
steenkin' Shift JIS, Big Five, or NEC kanji.

    Jiro> ##http://support.microsoft.com/default.aspx?scid=%2Fisapi%2Fgomscom%2Easp%3Ftarget%3D%2Fjapan%2Fsupport%2Fkb%2Farticles%2Fjp170%2F5%2F59%2Easp&LN=JA
    Jiro> ###BTW I do not much care about this Win issue ;-p, it's
    Jiro> just a example.

Oh, _that_.  Of course you can't round trip when the coded character
set _intentionally_ provides multiple code points for the same
character.  Unless you go out of your way to cater to the brain-damage
(cf full/half-width compatibility character in FF row of the BMP).

This is _exactly_ the kind of junk you don't have to worry about if
you restrict internal text to Unicode.

    >> Other than that, there are no efforts I know of.  Again, be
    >> concrete.

    Jiro> I'm NOT talking about the problem of NOW.  Who knows it's
    Jiro> never happen?

Are you talking about SETI?[1]  No sane earthling will design a
character set to be incompatible with Unicode ever again.

    Jiro> What seems to be the problem?

CSI means that arbitrarily stupid character encodings (Shift JIS is a
leading example) can get inside my program.  This means that _my_
program needs to deal with _their_ brain damage.  I don't want my
program to ever deal with Shift JIS.  If my users want to see Shift
JIS, I'll translate at the program boundary.  I don't have a problem
with that.

Fewer lines of code, fewer libraries, means fewer things to go wrong.


Footnotes: 
[1]  Search for Extra-Terrestrial Intelligence.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
 My nostalgia for Icon makes me forget about any of the bad things.  I don't
have much nostalgia for Perl, so its faults I remember.  Scott Gilbert c.l.py


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links