Re: [tlug] Emacs IME, locale, encodings, R, aarrrrgggghhhh!!!!

Date: Tue, 9 Mar 2021 15:37:16 +0900
From: "Stephen J. Turnbull" <turnbull.stephen.fw@example.com>
Subject: Re: [tlug] Emacs IME, locale, encodings, R, aarrrrgggghhhh!!!!
References: <d6b21c8964fcf607f447aa99898ca59fc19c8ae3.camel@uchicago.edu> <24645.55562.460201.880422@turnbull.sk.tsukuba.ac.jp> <00fcee6526a6c8c360532a56c23ae57adc25fc5c.camel@uchicago.edu>

Stuart Luppescu writes:
 > On Mon, 2021-03-08 at 16:58 +0900, Stephen J. Turnbull wrote:

 > > In a fresh Emacs, try M-x setenv RET LC_CTYPE RET ja_JP.UTF-8 RET, and
 > > M-: (setq default-process-coding-system 'utf-8) RET,
 > > 
 > > then run R and try the program.
 > 
 > Didn't change anything.

OK.

In the inferior R started by Emacs, what are the values of the
environment variables LC_ALL, LC_CTYPE, and LANG?
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Sys.getenv.html

 > I don't know. My regular terminal (wterm) I know does very poorly with
 > non-latin characters, so I installed rxvt-unicode (urxvt). It did not
 > seem any different from wterm. :shrug:

Is wterm still maintained?  The SourceForge page is dated 2013.

This:

 > > Does
 > >     echo 平屋 どこかのマンション 湯河原マンション 熱海マンション
 > > do the right thing?
 > 
 > Nope. When I pasted that in, I get 
 >  echo ?? ????????? ???????? ???????  
 > ?? env_comp~ env_comp ???????

strongly suggests that the *terms aren't finding the fonts they
expect.  The question mark counts match the Japanese character counts,
so it appears it's being understood as UTF-8, but undisplayable.

This doesn't explain why things are weird in Emacs's inferior R
process, though.

 > > the value of Emacs's default-process-coding-system

 > this says utf-8

So Emacs *should* be sending UTF-8 to the inferior process, unless ESS
is setting the process-specific coding system differently (which seems
unlikely).

 > I would send you the program but it's a dumb little thing,

I'm not particularly interested in the program, I'm interested in the
embedded strings. :-)  If you cut and paste them, that process might
mess up the strings.  But I'm pretty sure at this point that the
problem isn't the strings -- with the exception of the first cut and
paste, they all seem to be originally correctly encoded as UTF-8.
Even the question marks!  So I think it's the locales that the various
programs are running in.

Stuart Luppescu writes in another post
<8ac8892a83ff83a14bed7a5ea8fef60a1c5dfb01.camel@example.com>:

 > Then I tried running R in a new terminal, and copied and pasted
 > from the program *displayed in another terminal*. This time I got
 > this:
 > 
 > > print(house.names)
 > [1]
 > "name"               "å¹³å±\u008b"               "ã\u0081©ã\u0081\u0093

These are representations of valid UTF-8, interpreted as Latin-1 (most
likely).  They are almost certainly Japanese, checking the first one
gives "平屋" as expected.

 > and the graph printed out with the labels in Japanese.

It appears to me that the program that Emacs is saving is properly
encoded in UTF-8, although it's very hard to be sure when data written
by emacs is being massaged by R, rxvt, and email in transmission.

 > For some reason, emacs is messing with the encoding and the
 > handling of the Japanese strings.

I don't think so.  Some of the evidence is consistent with that, but
taken as a whole the evidence is pretty strong that Emacs is sending
the right, UTF-8-encoded text to files and to R, but that R and rxvt
are interpreting it incorrectly.  In the case of R in an Emacs
inferior process, the environment is set by Emacs, so that could be a
problem with Emacs.  I just don't think the problem is the text sent
by Emacs.

It's still *possible* that Emacs is sending the wrong thing to the R
in the inferior process, but I don't see why it would be doing that.

 > Also, it doesn't seem to matter what system locale is being used. It
 > seems to work as well (or as badly) if I set it to en_US.UTF-8 or to
 > ja_JP.UTF-8.

en vs. ja shouldn't matter here.  Only the encoding, UTF-8.  That's
why programmers should love Unicode -- it should make text encoding
issues moot (and will, *some*day ;-).  The other issues about
different languages are *much* harder to deal with.  Just consider
Japanese "era" dates! :-)

Regards,
Steve

References:
- [tlug] Emacs IME, locale, encodings, R, aarrrrgggghhhh!!!!
  - From: Stuart Luppescu
- [tlug] Emacs IME, locale, encodings, R, aarrrrgggghhhh!!!!
  - From: Stephen J. Turnbull
- Re: [tlug] Emacs IME, locale, encodings, R, aarrrrgggghhhh!!!!
  - From: Stuart Luppescu

Prev by Date: Re: [tlug] Running from USB memory stick (hardware issues)
Next by Date: [tlug] Job: system admin (+SOC2)
Previous by thread: Re: [tlug] Emacs IME, locale, encodings, R, aarrrrgggghhhh!!!!
Next by thread: Re: [tlug] Emacs IME, locale, encodings, R, aarrrrgggghhhh!!!!
Index(es):
- Date
- Thread

Home | Main Index | Thread Index