Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Re: Piping stderr?




At 26 Jun 2002 19:38:12 +0900,
Stephen J. Turnbull <stephen@example.com> wrote:
 
>> UTF-8 supports lots of character used in world wide, but not
>> perfect at all.
> 
> Be concrete.  I don't know of any major missing character sets or
> characters (that aren't scheduled or proposed for addition).
> Admittedly there are political problems (such as the influential
> Nikkei minorities in Canada, Mexico, and Finland whose national
> character sets look remarkably like IBM kanji; and the "Ukrainian
> problem" where the Russians on the USSR standards committee didn't see
> fit to submit Cyrillic characters only used in Ukrainain).

 I'm not saying that it doesn't include major missing character sets
or character, but I'm saying it's not perfect, that you agreed.

> In any case, either way effort has to be made to support those
> characters internally.  Why not devote that effort to getting them
> into Unicode, then subclassing Unicode to handle any special
> properties they have?

 Why have to devote to getting them into Unicode?
Unicode is not THE codeset, but ONE OF CODESETS.
It's just happened to have lots of characters, that's all.

>> Less burden I think.
> 
> For the programmer, when it works.  Consoles, shells, and scripts
> should not depend on such complexity, because when (_not if_) it
> breaks, it can take the whole system down.

 Which complexity?  About scripts, I agree coz script has own 
environment, it could be good idea to be free from system locale.

 I don't understand what you mean 'when', but it will just automatically 
fallback into 'C' locale, and continue working.  If it can't fallback into 
C", C library is broken, it means whole system already downed ;-).

> Also, in case you haven't noticed, the Internet and information
> systems generally have become a decidedly more hostile environment.
> Did you know that UTF-8 was respecified in Unicode 3.1 _for security
> reasons_?  How does CSI I18N handle the security issues involved in
> delegating text handling to user-provided routines, etc?  My bet is
> "not at all".

 ???  What are you talking about the security of Unicode 3.1??
You meant this?

SECURITY
       The  Unicode  and  UCS standards require that producers of
       UTF-8 shall use the shortest form possible, e.g.,  produc
       ing  a  two-byte sequence with first byte 0xc0 is non-con
       forming.  Unicode 3.1 has added the requirement that  con
       forming  programs  must  not  accept non-shortest forms in
       their input. This is for security reasons: if  user  input
       is  checked  for  possible  security violations, a program
       might check only for the ASCII version of "/../" or ";" or
       NUL  and  overlook  that  there are many non-ASCII ways to
       represent these things in a non-shortest UTF-8 encoding.

 If so, this IS the UTF-8 hard coded programs issue.  If you have
ten UTF-8 hard coded programs, you have to fix each programs.
On the other hand, on CSI design just fix library.  Programs don't
need to be modified anything.
#Even if this is not what you mentioned, it shows the bad thing of
#UTF-8 hard code programs.

 If what you meant is not that, please give me a pointer ;-).

>> But filter is not always perfect.  SJIS can't round trip
>> UTF-8 (e.g 0x5C) as you know.  It's like, you get home and
>> take the shoes off, later you try to get out with the same
>> shoes, but left shoe is stolen ;-).
> 
> Since when?  Since Unicode includes all characters in JIS, that means
> Shift JIS can't round trip JIS, either.  Wouldn't surprise me, but as
> far as I know that's not true.  You just have to use the right mapping.

 ah- SJIS handled on glibc can round trip UCS-4, sorry.
#In other words, glibc only handles that range.
##This is Windows case, but it is ;-)
##http://support.microsoft.com/default.aspx?scid=%2Fisapi%2Fgomscom%2Easp%3Ftarget%3D%2Fjapan%2Fsupport%2Fkb%2Farticles%2Fjp170%2F5%2F59%2Easp&LN=JA
###BTW I do not much care about this Win issue ;-p, it's just a example.

>> And more, in future it is very possble that codeset which
>> can't map into UTF-8.
> 
> Mojikyo?  That's not a character set, that's a glyph set.  Not to
> mention that it's nonstandard and nastily proprietary (the UTF-2000
> people were forced to remove mojikyo support from their version of
> XEmacs).  And there is plenty of room for a thousand Mojikyos in
> UCS-4.  It won't be Unicode-conformant, but upward compatible.

 No, I'm just talking about the possibility.

> Other than that, there are no efforts I know of.  Again, be concrete.

 I'm NOT talking about the problem of NOW.  Who knows it's never happen?

 Then, it's better strip encoding dependent code from programs
than hard code it.  CSI I18N designed programs support UTF-8 codeset,
you can use it as UTF-8 programs, if you want.  And it's easier to use API
than interpreting UTF-8, having Unicode character property database
inside the program.  What seems to be the problem?

-- 
Jiro SEKIBA | Web tools & AP Linux Competency Center, YSL, IBM Japan
            | email: jir@example.com, jir@example.com


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links