Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Re: Piping stderr?

At 28 Jun 2002 16:29:57 +0900,
Stephen J. Turnbull <> wrote:

> Sorry about the snippy response, but I was getting tired of going
> around in circles.

 That's O.K, :-).  It can be my fault as well, the expression might
not be suitable.  Sorry about that.

>  Do take a look at your own diagram, substitute
> "filter" for "API" at the top in two places.  The remaining difference
> is that the bottom specifies Unicode for the internal encoding (the
> intermediate conversion to UTF-8 is not significant).

 Well, actually this is the big point, I think.
If you use API inside programs, program can get what was original
encoding, but program never know the original encoding by using filter.
This causes information loss sometime.  For example, russian characters
have 2 width in EUC-JP, but in Unicode it's 1.  If programs knows,
original encoding, it can correct that information.
#well all encoding things is done by library.

 You may say that is font problem, well I'd like to say that too ;-).
But there is an API wcwidth, which mesure character width.
This API doesn't work well.

 And the other example is, GB18030.  GB18030 <-> Unicode 
can be done by algorithm, but some codepoint of GB18030 map into
private area.  So if it is done by outside of program, program
have no idea of property of that PA code point.

> Starting from my statements about shells and scripting languages, I
> cannot say "substitute API for filter in the bottom".  Because, I
> didn't describe an API.  However, for scripts there is iconv(1), and
> for C programs there is iconv(3) (and the gconv family on GNU systems).
> This can pretty easily be elaborated into an API.  Of course you still
> need all the LC_* locale variables (even LC_CTYPE and LC_MESSAGES),
> but now you can ignore the .encoding portion, and life is a lot easier.

 About script language, as I said, I agree.

 About shell, I disagree.  I understand demand that you don't want shell
to be complex.  But if you have CSI shell, you can switch off that by
setting locale to 'LANG=C'.  But if hard-cording UTF-8, you have to input
UTF-8 string.

 Well if you say "shell must be just 8 bit clean", it's acceptable.
But user want to delete character by Backspace, I think.
And still, you can switch off by LANG=C, even it was CSI.
#Well, CSI is more complex than 8 bit clean, I have to agree ;-)

> Why does the internal encoding "have to be" Unicode?  It doesn't (Mule
> is an example of a non-Unicode internal encoding, TRON code another).
> But clearly you don't want it to be modal, and you do want it to be
> universal, so that you can actually look at it in the debugger and
> stuff.  So why not have it be the international standard?

 wchar_t is international standard as well.  And encoding of wchar_t
is not defined, which means, you don't need to care about encodings.
You can obtain wchar_t by just calling mbtowc(3) or other mb* functions.

 I think what program needs is character properties, not encoding.
You can access to the character properties through the wc* functions.
I think this abstraction is good enough to handle any external codesets.
#I have to confess that we need more API to access properties ;-)

 iconv(3) is unfortunately not portable enough, but it's a different topic.
Jiro SEKIBA | Web tools & AP Linux Competency Center, YSL, IBM Japan
            | email:,

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links