Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] "UTF-8 & ISO-2022-JP"



I haven't cross-posted anything else, but I sent a nearly identical letter about UTF-8 & ISO-2022-JP to a general site to see what people might say. Most of the responses were junk, but the following one seems interesting... I don't personally know how accurate it is... any comments? And, um... I certainly hope this doesn't start any arguments. I'm just posting it in my quest to better understand the on-the-ground situation with getting text through the wires......

Lyle

[Comment from general list]
"There are two issues with email media: the transfer encoding (ie 7bit, quoted printable or base64) and the character set label (ascii, shift-jis, etc). The former is vital for establishing a reliable transmission, whereas the latter is just a convenient label intended to be helpful to humans, kind of like a file name or description, but not particularly binding. Any competent software which recognizes multiple character sets must either discover the applicable set heuristically, or ask the user. That's an issue with the software, but completely unrelated to email. It sucks that you have this issue, but what needs to be done is pester the software vendor to improve automatic detection. A mailer must ensure that the message is transfer encoded in such a way as to preserve the information. Sending 8bit data is a no-no. If data is properly sent, then the bit stream at the receiver end is identical to the bit stream at the sender end, and data mangling is impossible. The mail responsibility then stops, and the viewer software is responsible for interpreting the data. When your software is properly set up, it should be giving the viewer the supplied hints such as the character set label. In this diary, the solution of setting the character set explicitly does just that, but it's overkill because neither the sending software nor the receiving software is doing their job properly. Unfortunately, the original mail standards used 7bit ascii data in the clear, which caused untold programmers who don't like to read to assume that the natural mail format is 8bit data without encoding. Lyle should not have to do any of this. If he sends properly encoded data, then the receiving software ought to be able to guess from the binary stream, if he sends 8bit data then no matter what character set he specifies it may end up corrupted and unusable at the other end. My guess is when he sets the encoding explicitly as ISO-2022-JP, then his software knows that he's using a multibyte encoding, and therefore sends a base64 encoded packet, which arrives correctly. Teh receiver looks at the binary stream, probably ignoring the character set hint, and displays the result in something which is either ISO-2022-JP or close enough. When Lyle sends UTF-8, his software thinks it's ok to send 8bit data, which arrives possibly mangled or transformed, and the receiving software fails to detect and display the result properly. If the mailer was configured to always send base64, then there would be no need to worry about setting ISO-2022-JP/UTF-8 explicitly."




Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links