
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] "UTF-8 & ISO-2022-JP"
- Date: Tue, 06 Dec 2005 13:50:10 +0900
- From: "Lyle (Hiroshi) Saxon" <ronfaxon@example.com>
- Subject: Re: [tlug] "UTF-8 & ISO-2022-JP"
- References: <4393C9A2.7000103@example.com>
- Organization: Images Through Glass
- User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050511
I haven't cross-posted anything else, but I sent a nearly identical
letter about UTF-8 & ISO-2022-JP to a general site to see what people
might say. Most of the responses were junk, but the following one seems
interesting... I don't personally know how accurate it is... any
comments? And, um... I certainly hope this doesn't start any
arguments. I'm just posting it in my quest to better understand the
on-the-ground situation with getting text through the wires......
Lyle
[Comment from general list]
"There are two issues with email media: the transfer encoding (ie
7bit, quoted printable or base64) and the character set label (ascii,
shift-jis, etc). The former is vital for establishing a reliable
transmission, whereas the latter is just a convenient label intended to
be helpful to humans, kind of like a file name or description, but not
particularly binding.
Any competent software which recognizes multiple character sets
must either discover the applicable set heuristically, or ask the user.
That's an issue with the software, but completely unrelated to email. It
sucks that you have this issue, but what needs to be done is pester the
software vendor to improve automatic detection.
A mailer must ensure that the message is transfer encoded in such a
way as to preserve the information. Sending 8bit data is a no-no. If
data is properly sent, then the bit stream at the receiver end is
identical to the bit stream at the sender end, and data mangling is
impossible. The mail responsibility then stops, and the viewer software
is responsible for interpreting the data. When your software is properly
set up, it should be giving the viewer the supplied hints such as the
character set label.
In this diary, the solution of setting the character set explicitly
does just that, but it's overkill because neither the sending software
nor the receiving software is doing their job properly.
Unfortunately, the original mail standards used 7bit ascii data in
the clear, which caused untold programmers who don't like to read to
assume that the natural mail format is 8bit data without encoding.
Lyle should not have to do any of this. If he sends properly
encoded data, then the receiving software ought to be able to guess from
the binary stream, if he sends 8bit data then no matter what character
set he specifies it may end up corrupted and unusable at the other end.
My guess is when he sets the encoding explicitly as ISO-2022-JP, then
his software knows that he's using a multibyte encoding, and therefore
sends a base64 encoded packet, which arrives correctly. Teh receiver
looks at the binary stream, probably ignoring the character set hint,
and displays the result in something which is either ISO-2022-JP or
close enough. When Lyle sends UTF-8, his software thinks it's ok to send
8bit data, which arrives possibly mangled or transformed, and the
receiving software fails to detect and display the result properly. If
the mailer was configured to always send base64, then there would be no
need to worry about setting ISO-2022-JP/UTF-8 explicitly."
Home |
Main Index |
Thread Index