Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Emacs IME, locale, encodings, R, aarrrrgggghhhh!!!!



Curt J. Sampson writes:

 > I've never quite understood the appeal of Bash in an Emacs window
 > in a tmux window in an X11 window. :-P)

I don't either:

$ ls -l /usr/bin/xemacs
lrwxr-xr-x  1 steve  staff  6 Mar 15 01:34 /usr/bin/xemacs -> /sbin/init

;-)

 > `xxd` or another hexdump program may be handy [to check UTF-8-ness].

Sure, but I already know whether what's in Stuart's email is UTF-8 or
not from looking at the Latin-1.  I suppose it might be easier for
*Stuart* to learn the basics[1] using hex rather than Latin-1 (where
"you're lost in a string of twisty accented vowels all alike"), but
it's even easier to just ask someone who's lived that nightmare.  What
I don't know is where in the pipeline of buffers between his Emacs and
my XEmacs things are getting munged.

I suggested Stuart's program rather than "echo 'これは日本語です。'"
because we already know that that has various results for different
output media, and doesn't require him to do things that he might
interpret differently from what I think I'm asking him to do.

 > > Only the encoding, UTF-8.  That's why programmers should love
 > > Unicode -- it should make text encoding issues moot (and will,
 > > *some*day ;-).

 > Well, yes, it does once you've dealt with UTF-8 vs. UTF-16 vs. unencoded
 > UCS-2, big- vs. little-endian UTF-16/UCS-2, the presence or not of byte
 > order markers....

These are all basically trivial to autodetect on the assumption that
it should be human-readable text, though -- you don't even really need
to know which language.  (Big- vs. little-endian UTF-16 requires
statistical analysis, but rarely very much data.)  But I doubt even
Google does a good job on distinguishing ISO-8859-1 vs. ISO-8859-15,
ISO-8859-2 vs.  ISO-8859-16, or among Japanese corporate versions of
JIS (whether encoded as ISO-2022-JP, Shift JIS, or EUC-JP).  Those
issues however are moot with Unicode.

In actual practice, I'm pretty sure that octets are not going away any
time soon, so UTF-8 will (eventually) be universally used for all
exchange of text in IPC: there's no good reason to encode a substream
of text in anything else.[2]  The widechar versions are going to be
irrelevant unless you're implementing a programming language designed
for very precise and efficient implementations of text processing.
Anything that non-systems-programmers will get on their terminals via
stdout will be UTF-8.


Footnotes: 
[1]  I'm assuming he doesn't already know how UTF-8 is encoded, and
that might be some what rude (in which case I apologize), but I'm
pretty sure if he did he would have commented on the output he posted.

[2]  Except maybe on a single Windows host or in Java core dumps, but
that's Microsoft's or Oracle's problem, not mine.




Home | Main Index | Thread Index