Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Re: WWW page charsets (was: font encoding question)



[Changed the Subject, as this has nothing to do with *font* encoding
(whatever that may be...)]

steve smith <sjs@example.com> wrote:
Brian Chandler wrote:
> steven smith wrote:

> Actually I believe it is simpler than this. If you have a webpage
> encoded in UTF-8, you can (*) assume that the browser will return form
> input values in the same encoding.

Are you sure?

It's the default, and works almost all the time. The only times I have encountered problems with that assumption were (a) in the ancient Mac version of Netscape 2, where it was assumed that all Japanese pages were only in Shift_JIS, and mojibaked everything else, and (b) the lite browser in DoCoMo keitais, which also only allows Shift_JIS.

From the discussion that's been going on in
the "WWWJDIC backdoor issue" thread, I'm not sure it's that
simple.

That discussion was actually about the innards of browser add-ons. The problems that triggered that thread don't really involve the charsets used in regular forms.

This isn't quite the same since I'll be serving a
form, but...  somehow assuming always seems to get me in
trouble.  If someone can verify that a page will return text
in the font it has been encoded in, I'd be delighted.

It's the usual behaviour. WWWJDIC works that way, and gets several million uses a week. I've never had a complaint on that score.

I'm
still trying to wrap my mind around font-encoding and the
issues involved.

You'll get your head a bit straighter not calling them fonts. They are characters, and we are talking about character sets (e.g. JIS X 0208 or Unicode) and character encoding/encapsulation systems (Shift_JIS, UTF8, etc.) In WWW/Internet-speak, these are often conflated into "charset".

Fonts, e.g. Mincho, Arial, Helvetica, etc. are different things.

In any event, in the "backdoor" thread Stephen Turnbull
pointed out this link:
http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset
so assuming that the browser pays attention, I plan on using
both your suggestion, Stephen's, and then praying for the
best :)

By all means use the "accept-charset" in the <FORM ...>, but be prepared for some browsers ignoring it. Be safe, and assume the text is being sent to the server in the coding/charset of the page in which the form is embedded.

Like they said in that thread, one standard would be nice
(though the context was a bit different).  I have to make
sure everything I send to friends in Japan is in ISO-2022 or
they see 文字化け,

Correct.

but most of the rest of the world (and I
think this list usually) is utf-8.

Not so, although Unicode/UTF8 is getting more common. I email in ISO-2022-JP (or more correctly, I ask Gmail to use the default charset for the text I am sending, and the email default fo Japanese is ISO-2022-JP.)

And then there's
ISO-8859.

Yes, ISO-8859-1 is the default for WWW pages.

For me, font encoding often seems to do the
unexpected.  Augh...

The character coding is exactly what you ask it to be. If something unexpected pops up, you probably asked for the wrong thing.

Cheers

Jim
--
Jim Breen
Honorary Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links