
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] Re: WWW page charsets (was: font encoding question)
- Date: Mon, 18 Jun 2007 18:22:46 +1000
- From: "Jim Breen" <jimbreen@example.com>
- Subject: [tlug] Re: WWW page charsets (was: font encoding question)
[Changed the Subject, as this has nothing to do with *font* encoding
(whatever that may be...)]
steve smith <sjs@example.com> wrote:
Brian Chandler wrote:
> steven smith wrote:
> Actually I believe it is simpler than this. If you have a webpage
> encoded in UTF-8, you can (*) assume that the browser will return form
> input values in the same encoding.
Are you sure?
It's the default, and works almost all the time. The only times I have
encountered problems with that assumption were (a) in the ancient Mac
version of Netscape 2, where it was assumed that all Japanese pages
were only in Shift_JIS, and mojibaked everything else, and (b) the
lite browser in DoCoMo keitais, which also only allows Shift_JIS.
From the discussion that's been going on in
the "WWWJDIC backdoor issue" thread, I'm not sure it's that
simple.
That discussion was actually about the innards of browser add-ons. The
problems that triggered that thread don't really involve the charsets used
in regular forms.
This isn't quite the same since I'll be serving a
form, but... somehow assuming always seems to get me in
trouble. If someone can verify that a page will return text
in the font it has been encoded in, I'd be delighted.
It's the usual behaviour. WWWJDIC works that way, and gets several
million uses a week. I've never had a complaint on that score.
I'm
still trying to wrap my mind around font-encoding and the
issues involved.
You'll get your head a bit straighter not calling them fonts. They
are characters, and we are talking about character sets (e.g. JIS X 0208
or Unicode) and character encoding/encapsulation systems (Shift_JIS, UTF8, etc.)
In WWW/Internet-speak, these are often conflated into "charset".
Fonts, e.g. Mincho, Arial, Helvetica, etc. are different things.
In any event, in the "backdoor" thread Stephen Turnbull
pointed out this link:
http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset
so assuming that the browser pays attention, I plan on using
both your suggestion, Stephen's, and then praying for the
best :)
By all means use the "accept-charset" in the <FORM ...>, but be prepared
for some browsers ignoring it. Be safe, and assume the text is being
sent to the server in the coding/charset of the page in which the
form is embedded.
Like they said in that thread, one standard would be nice
(though the context was a bit different). I have to make
sure everything I send to friends in Japan is in ISO-2022 or
they see 文字化け,
Correct.
but most of the rest of the world (and I
think this list usually) is utf-8.
Not so, although Unicode/UTF8 is getting more common. I email
in ISO-2022-JP (or more correctly, I ask Gmail to use the default
charset for the text I am sending, and the email default fo Japanese
is ISO-2022-JP.)
And then there's
ISO-8859.
Yes, ISO-8859-1 is the default for WWW pages.
For me, font encoding often seems to do the
unexpected. Augh...
The character coding is exactly what you ask it to be. If
something unexpected pops up, you probably asked for the wrong
thing.
Cheers
Jim
--
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/
Home |
Main Index |
Thread Index