Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Re: Japanese in URLs?



Nguyen Vu Hung writes:
 > 2008/2/6, Jim Breen <jimbreen@example.com>:

 > > and I don't want the browser to play it back to me as an
 > > expletive in Klingon because it decided it was somethig in UTF-8.
 > > It's different, of course, if the field has an ACE prefix such as
 > > "xn--".

 > RFC2718[1] says the URL *should* be encoded after the character sequences
 > is transtalted to UTF-8.

No, it doesn't.  First off, RFCs are supposed to be about wire
protocols.  How browsers present data received from users or the wire
is basically off-limits to RFCs; that's really more a field for W3C
recommendations.  Second, RFC 2718 is a informational companion to RFC
2717 (how to register new URL schemes), and is not standards-track.
The appropriate references here would be to internationalized URLs,
cf. RFCs 3454, 3490-3492, 3743, 4290, 4690, and especially RFC 3987.

As far as I know, browsers which display anything but the hex-encoded
path are strictly speaking in violation of RFC 3987:

  6.2.  Software Interfaces and Protocols

  Although an IRI is defined as a sequence of characters, software
  interfaces for URIs typically function on sequences of octets or
  other kinds of code units.  Thus, software interfaces and protocols
  MUST define which character encoding is used.

because there is no provision in any URL scheme I know of for defining
the character encoding, with the exception of IDNA's "xn--" ACE prefix
which implies PUNYCODE UTF (RFC 3492).  (Note that RFC 3987 does *not*
define an ACE for the path portion of an IRI.  That means that there
is no in-band way of recognizing the ACE representation of an IRI.)[1]
Even there, RFC 3490 says:

6.1 Entry and display in applications

   [...]  ACE encoding is opaque and ugly, and should thus only be
   exposed to users who absolutely need it.  Because name labels
   encoded as ACE name labels can be rendered either as the encoded
   ASCII characters or the proper decoded characters, the application
   MAY have an option for the user to select the preferred method of
   display; if it does, rendering the ACE SHOULD NOT be the default.

 > What Firefox doing is not wrong but personally, I think the browser
 > should be able to display actual Japanese for better readability.

Only at the user's explicit request.

You know, there are three kinds of people (very loosely speaking) who
still put non-ASCII into mail headers (ie, without encoding as
MIME-words): spammers, Russians, and Japanese.  I think it's really
sad that the real humans are classed with those haploid spammers!
That's because Japanese (and Russian) programmers arrogantly decided
that they didn't need I18N and just detect everything according to
whichever of their encodings a string fits.

Browsers that detect encodings in URLs are making the same mistake and
are in violation of the section of RFC 3987 quoted above.


Footnotes:
[1]  ACE means "ASCII-compatible encoding" and is defined in RFC 3490,
but probably elsewhere as well since it lists many other ACE prefixes
that have been defined.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links