Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Website Question(s)

>>>>> "Lyle" == Lyle Saxon <Lyle> writes:

    Lyle> The language angle is one thing I've been wondering about -
    Lyle> the person having trouble with the links is in Portugal....

Not a problem.  It can only be a problem in a language which uses
7-bit codes in a way incompatible with ASCII.  That means JIS Roman,
ancient European codes for non-Romance languages (cf Bjarne's "The C++
Programming Language" 1st ed, where he discusses ANSI trigraphs), and
multibyte 7-bit ISO 2022 codes (in practice, used only by Japanese and
Koreans).  (EBCDIC doesn't really count, here.)

[It always amazes me; every time you find a really bogus standard (or,
to be kinder, one that was written for an environment where the Intel
8008 was an "advanced single-chip microprocessor" and scratchpad
memory was implemented with paper) , it turns out that the Japanese
have one just like it, and it's still in occasional use in 2005.]

    Lyle> <>

It's weird that that second form works; I would think that the browser
should URL-encode the '%'.

Hmm.  Better look it up.
(which has been superseded) says:

   Octets must be encoded if they have no corresponding graphic
   character within the US-ASCII coded character set, if the use of the
   corresponding character is unsafe, or if the corresponding character
   is reserved for some other interpretation within the particular URL

The unsafe characters, including "~", are listed in the RFC, so we can
consider this to be a predefined list.  Technically, then,

is not an URL in the sense of RFC 1738.  However, apparently it's
acceptable in HTTP URLs because of a special rule for HTTP (from RFC
2396 which superseded RFC 1718):

   In some cases, data that could be represented by an unreserved
   character may appear escaped; for example, some of the unreserved
   "mark" characters are automatically escaped by some systems.  If the
   given URI scheme defines a canonicalization algorithm, then
   unreserved characters may be unescaped according to that algorithm.
   For example, "%7e" is sometimes used instead of "~" in an http URL
   path, but the two are equivalent for an http URL.

So you can legally write it either way.  To know exactly what's going
on (ie, what gets canonicalized where), you'd have to read the HTTP
RFC 2616.[1]  The bottom line seems to be that the practice of escaping
"~" in HTTP URLs goes back to people trying to comply with RFC 1738,
or maybe (as Brett suggested) so that you can type the URL using only
characters appearing as labels on your keyboard.  However, today you
can use either form, with "~" being recommended.

The authors go on to say:

   Because the percent "%" character always has the reserved purpose of
   being the escape indicator, it must be escaped as "%25" in order to
   be used as data within a URI.  Implementers should be careful not to
   escape or unescape the same string more than once, since unescaping
   an already unescaped string might lead to misinterpreting a percent
   data character as another escaped character, or vice versa in the
   case of escaping an already escaped string.

Translation into language we can all understand: MUZUKASHII DA YO NE!!

[1]  Actually, you have to read between the lines, because 2396
doesn't define any "unsafe" characters but 2616 refers to the unsafe
characters as defined by 2396!

School of Systems and Information Engineering
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links