TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] "UTF-8 & ISO-2022-JP"

Date: Tue, 06 Dec 2005 13:50:10 +0900

From: "Lyle (Hiroshi) Saxon" <ronfaxon@example.com>

Subject: Re: [tlug] "UTF-8 & ISO-2022-JP"

References: <4393C9A2.7000103@example.com>

Organization: Images Through Glass

User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050511

I haven't cross-posted anything else, but I sent a nearly identicalletter about UTF-8 & ISO-2022-JP to a general site to see what peoplemight say. Most of the responses were junk, but the following one seemsinteresting... I don't personally know how accurate it is... anycomments? And, um... I certainly hope this doesn't start anyarguments. I'm just posting it in my quest to better understand theon-the-ground situation with getting text through the wires......
Lyle

[Comment from general list]
"There are two issues with email media: the transfer encoding (ie7bit, quoted printable or base64) and the character set label (ascii,shift-jis, etc). The former is vital for establishing a reliabletransmission, whereas the latter is just a convenient label intended tobe helpful to humans, kind of like a file name or description, but notparticularly binding.Any competent software which recognizes multiple character setsmust either discover the applicable set heuristically, or ask the user.That's an issue with the software, but completely unrelated to email. Itsucks that you have this issue, but what needs to be done is pester thesoftware vendor to improve automatic detection.A mailer must ensure that the message is transfer encoded in such away as to preserve the information. Sending 8bit data is a no-no. Ifdata is properly sent, then the bit stream at the receiver end isidentical to the bit stream at the sender end, and data mangling isimpossible. The mail responsibility then stops, and the viewer softwareis responsible for interpreting the data. When your software is properlyset up, it should be giving the viewer the supplied hints such as thecharacter set label.In this diary, the solution of setting the character set explicitlydoes just that, but it's overkill because neither the sending softwarenor the receiving software is doing their job properly.Unfortunately, the original mail standards used 7bit ascii data inthe clear, which caused untold programmers who don't like to read toassume that the natural mail format is 8bit data without encoding.Lyle should not have to do any of this. If he sends properlyencoded data, then the receiving software ought to be able to guess fromthe binary stream, if he sends 8bit data then no matter what characterset he specifies it may end up corrupted and unusable at the other end.My guess is when he sets the encoding explicitly as ISO-2022-JP, thenhis software knows that he's using a multibyte encoding, and thereforesends a base64 encoded packet, which arrives correctly. Teh receiverlooks at the binary stream, probably ignoring the character set hint,and displays the result in something which is either ISO-2022-JP orclose enough. When Lyle sends UTF-8, his software thinks it's ok to send8bit data, which arrives possibly mangled or transformed, and thereceiving software fails to detect and display the result properly. Ifthe mailer was configured to always send base64, then there would be noneed to worry about setting ISO-2022-JP/UTF-8 explicitly."
References:

[tlug] "UTF-8 & ISO-2022-JP"
From: Lyle (Hiroshi) Saxon

Prev by Date: Re: [tlug] "UTF-8 & ISO-2022-JP"

Next by Date: Re: [tlug] "UTF-8 & ISO-2022-JP"

Previous by thread: Re: [tlug] "UTF-8 & ISO-2022-JP"

Next by thread: Re: [tlug] "UTF-8 & ISO-2022-JP"

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links