Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Re: Piping stderr?

At 27 Jun 2002 19:09:21 +0900,
Stephen J. Turnbull <> wrote:
>> Unicode is not THE codeset, but ONE OF CODESETS.
> I think you have been listening to Ohta-san's propaganda for too long.

 ???? Who is that?  I realy don't know who he is(Oh maybe Unicode hater:-)
and have never heard any propaganda.

> Unicode doesn't have "lots" of characters.  CNS 11643 has "lots" of
> characters, more than Unicode 2.1 (and probably more than 3.2, but I
> haven't checked).  Unicode _is_ THE Universal Character Set (UCS).
> Plus a whole bunch of essential algorithms for handling text.

 So?  It is still one of the codesets. 
I may invent GCS(Galaxy Character Set).
> And if you really really have to have some character (or character
> set) that Unicode doesn't provide, there are a hundred thousand
> private space code points reserved for _you personally_.  What's the
> problem?

 Oh I'm very sad, I can't exchange my super nice thesis with you,
because I needed to use my private area.. if only I can use other

> Exactly.  But one of the reasons it doesn't fallback into C may very
> well be because the I18N library _thinks_ it's OK, but it's broken.
> This is not sufficient reason for my system to crash; dunno how you
> feel about that....

 Ah, it's libraries bug.  Nothing to do with CSI design.
Fix as library thinks it's broken :-)
>> If so, this IS the UTF-8 hard coded programs issue.
> Who said "hard code" UTF-8?  In fact, I don't need that the programs
> I'm talking about to _ever_ interpret UTF-8.  They interpret ASCII;
> anything containing non-ASCII is part of a string or a comment, and
> will be passed on verbatim or ignored.  Validation, if necessary,
> should be done by other programs or the library functions called.  All
> that needs to be hard-coded is recognition of a character:
> /* yes, I know there are much faster table-driven ways to do this */
> if (*p & 0x80 == 0x00)          /* ASCII */
>   length = 1;
> else if (*p & 0xE0 == 0xC0)     /* multibyte */
>   length = 2;
> else if (*p & 0xF0 == 0xE0)
>   length = 3;
> else if (*p & 0xF8 == 0xF0)
>   length = 4;
> else if (*p & 0xFC == 0xF8)
>   length = 5;
> else if (*p & 0xFE == 0xFC)
>   length = 6;
> else                            /* illegal first byte, including 10xxxxxx */
>   abort();

 It can be done by library according to locale if you write CSI program.

>> If you have ten UTF-8 hard coded programs, you have to fix
>> each programs.  On the other hand, on CSI design just fix
>> library.  Programs don't need to be modified anything.
> Wrong.  Dangerous, ugly stuff like Shift JIS will be wandering around
> _inside_ my program.  To handle it correctly, I will need extra code.
> In _all_ my CSI programs.

 ????  What is wrong?  Which sentence?  Or whole paragraph?
I don't understand.  Sorry ;-).

 And do you know the CSI?  or lets's say wchar_t?
You don't need to notice SJIS/EUC/UTF-8, that's CSI.
I'm very confused.

>> Even if this is not what you mentioned, it shows the bad
>> thing of UTF-8 hard code programs.
> I'm not advocating doing _anything_ by hard-coding in each program.

 But you wrote hard-coding program just above....

> I'm advocating that simple applications that need to be robust should
> restrict themselves to a single small library intended to do just one
> well-defined thing well: process Unicode character streams, character
> by character.  No bidi, no composed characters, no interpretation of
> surrogates (illegal in UTF-8 but I don't need to care).  And no
> steenkin' Shift JIS, Big Five, or NEC kanji.

 Ah, this IS the your point, finally, I got you.
Then use Unicode for your own purpose.
Unicode may be the best solution for you, but not for everybody.

> Are you talking about SETI?[1]  No sane earthling will design a
> character set to be incompatible with Unicode ever again.

 GB18030.  But you don't have to worry about it.
You are happy with Unicode, I know.   Even not UCS-4, I'm very impressed.

> CSI means that arbitrarily stupid character encodings (Shift JIS is a
> leading example) can get inside my program.  This means that _my_
> program needs to deal with _their_ brain damage.  I don't want my
> program to ever deal with Shift JIS.  If my users want to see Shift
> JIS, I'll translate at the program boundary.  I don't have a problem
> with that.

 Again do you know what CSI is?
But fortunately, you have a choice, not to use SJIS,
So you don't need to warry about SJIS going into your CSI programs :-).

Jiro SEKIBA | Web tools & AP Linux Competency Center, YSL, IBM Japan
            | email:,

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links