
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Re: Piping stderr?
At 27 Jun 2002 19:09:21 +0900,
Stephen J. Turnbull <stephen@example.com> wrote:
>> Unicode is not THE codeset, but ONE OF CODESETS.
>
> I think you have been listening to Ohta-san's propaganda for too long.
???? Who is that? I realy don't know who he is(Oh maybe Unicode hater:-)
and have never heard any propaganda.
> Unicode doesn't have "lots" of characters. CNS 11643 has "lots" of
> characters, more than Unicode 2.1 (and probably more than 3.2, but I
> haven't checked). Unicode _is_ THE Universal Character Set (UCS).
> Plus a whole bunch of essential algorithms for handling text.
So? It is still one of the codesets.
I may invent GCS(Galaxy Character Set).
> And if you really really have to have some character (or character
> set) that Unicode doesn't provide, there are a hundred thousand
> private space code points reserved for _you personally_. What's the
> problem?
Oh I'm very sad, I can't exchange my super nice thesis with you,
because I needed to use my private area.. if only I can use other
encoding...
> Exactly. But one of the reasons it doesn't fallback into C may very
> well be because the I18N library _thinks_ it's OK, but it's broken.
> This is not sufficient reason for my system to crash; dunno how you
> feel about that....
Ah, it's libraries bug. Nothing to do with CSI design.
Fix as library thinks it's broken :-)
>> If so, this IS the UTF-8 hard coded programs issue.
>
> Who said "hard code" UTF-8? In fact, I don't need that the programs
> I'm talking about to _ever_ interpret UTF-8. They interpret ASCII;
> anything containing non-ASCII is part of a string or a comment, and
> will be passed on verbatim or ignored. Validation, if necessary,
> should be done by other programs or the library functions called. All
> that needs to be hard-coded is recognition of a character:
>
> /* yes, I know there are much faster table-driven ways to do this */
> if (*p & 0x80 == 0x00) /* ASCII */
> length = 1;
> else if (*p & 0xE0 == 0xC0) /* multibyte */
> length = 2;
> else if (*p & 0xF0 == 0xE0)
> length = 3;
> else if (*p & 0xF8 == 0xF0)
> length = 4;
> else if (*p & 0xFC == 0xF8)
> length = 5;
> else if (*p & 0xFE == 0xFC)
> length = 6;
> else /* illegal first byte, including 10xxxxxx */
> abort();
It can be done by library according to locale if you write CSI program.
>> If you have ten UTF-8 hard coded programs, you have to fix
>> each programs. On the other hand, on CSI design just fix
>> library. Programs don't need to be modified anything.
>
> Wrong. Dangerous, ugly stuff like Shift JIS will be wandering around
> _inside_ my program. To handle it correctly, I will need extra code.
> In _all_ my CSI programs.
???? What is wrong? Which sentence? Or whole paragraph?
I don't understand. Sorry ;-).
And do you know the CSI? or lets's say wchar_t?
You don't need to notice SJIS/EUC/UTF-8, that's CSI.
I'm very confused.
>> Even if this is not what you mentioned, it shows the bad
>> thing of UTF-8 hard code programs.
>
> I'm not advocating doing _anything_ by hard-coding in each program.
But you wrote hard-coding program just above....
> I'm advocating that simple applications that need to be robust should
> restrict themselves to a single small library intended to do just one
> well-defined thing well: process Unicode character streams, character
> by character. No bidi, no composed characters, no interpretation of
> surrogates (illegal in UTF-8 but I don't need to care). And no
> steenkin' Shift JIS, Big Five, or NEC kanji.
Ah, this IS the your point, finally, I got you.
Then use Unicode for your own purpose.
Unicode may be the best solution for you, but not for everybody.
> Are you talking about SETI?[1] No sane earthling will design a
> character set to be incompatible with Unicode ever again.
GB18030. But you don't have to worry about it.
You are happy with Unicode, I know. Even not UCS-4, I'm very impressed.
> CSI means that arbitrarily stupid character encodings (Shift JIS is a
> leading example) can get inside my program. This means that _my_
> program needs to deal with _their_ brain damage. I don't want my
> program to ever deal with Shift JIS. If my users want to see Shift
> JIS, I'll translate at the program boundary. I don't have a problem
> with that.
Again do you know what CSI is?
But fortunately, you have a choice, not to use SJIS,
So you don't need to warry about SJIS going into your CSI programs :-).
--
Jiro SEKIBA | Web tools & AP Linux Competency Center, YSL, IBM Japan
| email: jir@example.com, jir@example.com
Home |
Main Index |
Thread Index