Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Font Encodings - Re: tlug: Java and Japanese



--------------------------------------------------------
tlug note from "Stephen J. Turnbull" <turnbull@example.com>
--------------------------------------------------------
I already sent a version of this to Craig; decided to clean it up and
pass it on to TLUG, TWIW.

On Thu, 28 Aug 1997, John Little wrote:

gaijin>% 
gaijin>% I'm not sure what the "8859_1" means.  Does anyone know?
gaijin>% 
gaijin>
gaijin>   ISO encoding 8859_1, usually known as "Western English" or
gaijin>   "Latin-1", as opposed to 8859_2, the encoding for "European
gaijin>   English". The latter includes codes for umlaut, cedilla, acute
gaijin>   and friends. Check out the X11 fonts directory (encoding).

This is not exactly true, in fact Latin-[1234] all have the accents
and stuff for the major European languages; they are tweaked for ones
with a small number of speakers.  Latin-[5678] are completely revamped
for Cyrillic, Arabic, Greek and Hebrew (which have no glyphs in common
with ASCII, and so no space for the accented glyphs in these sets).
Latin-9 and Latin-10 are needed only for one language each (Icelandic
and Turkish, respectively) and can handle most majors.  (Source:
Nishikimi, et al.  Maruchiringaru Kankyou no Jitsugen.  Prentice-Hall.)

>>>>> "Craig" == Craig Oda <craig@example.com> writes:

    Craig> That's weird.  I wonder why I have to I have to specify
    Craig> 8859_1?  I asked Tsurui-san about this and he said that he
    Craig> read it on the Java mailing list in reference to the JDBC.
    Craig> This is the same thing I was reading.  There really wasn't
    Craig> an explanation of why it was needed.  Tsurui-san thought it
    Craig> was the specification for unicode.

Nah.  Specifications of unicode and ISO-Latin-1 CAN'T matter (mostly)
because they are unrelated to the semantics of this program as long as
conversions are invertible.  Ie, the only things that're relevant are
that (1) the servlet NEVER produce non-Latin-1-equivalent Unicode
characters; (2) Latin-1 to Unicode is one-to-one; (3) none of the
bytes in the stream are non-Latin-1.

(1a) HTTP/1.x specifies that unless otherwise stated by a Content-Type
    header, HTTP message bodies (including POSTs) MUST (caps in the
    RFC 2068 :) be presumed to be ISO-8859-1.  Therefore if
    HttpServletRequest is correctly implemented, POSTs from broken
    clients will be interpreted by default as ISO-8859-1.
(1b) A Java program automatically converts strings into Unicode; by
    (1a), the servlet package must tell Java that the input is Latin-1.
(2) By specification.
(3) By specification (Latin-1 uses all 256 code points; no byte is out 
    of domain).

The hole in (1) generates the bug, which is that when a properly
internationalized client sends eg an ISO-2022-JP Content-Type or a
UTF-? Content-Type, the servlet package should (if HttpServletRequest
is properly implemented) produce Unicode Japanese out of those.  (The
default assumption is producing not Unicode but a 16-bit encoding of
8-bit bytes according to the Latin-1->Unicode tables. :-) This Unicode
Japanese should then bomb (out of range) on back-conversion to
ISO-8859-1 in Craig's code.

Having no knowledge of servlets, I don't know how to handle this.

Ciao

Steve
-- 
                            Stephen J. Turnbull
Institute of Policy and Planning Sciences                    Yaseppochi-Gumi
University of Tsukuba                      http://turnbull.sk.tsukuba.ac.jp/
Tel: +81 (298) 53-5091;  Fax: 55-3849              turnbull@example.com
Next TLUG meeting is Saturday October 11, 1997
-----------------------------------------------------------------
a word from the sponsor will appear below
TWICS - Japan's First Public-Access Internet System.
www.twics.com  info@example.com  Tel:03-3351-5977  Fax:03-3353-6096


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links