Re: tlug: Java and Japanese e-mail

To: tlug@example.com
Subject: Re: tlug: Java and Japanese e-mail
From: "Stephen J. Turnbull" <turnbull@example.com>
Date: Mon, 25 Aug 1997 11:44:51 +0900
In-reply-to: Your message of "Sun, 24 Aug 1997 18:58:57 +0900." <Pine.HPP.3.95.970824184319.20376A-100000@example.com>
Reply-To: tlug@example.com
Sender: owner-tlug

--------------------------------------------------------
tlug note from "Stephen J. Turnbull" <turnbull@example.com>
--------------------------------------------------------
>>>>> "Craig" == Craig Oda <craig@example.com> writes:

    >> From Todd Rudick, I got this bit of information:

    Craig> Java uses Unicode internally, and has full support for many
    Craig> encodings externally. For example, to convert an array of
    Craig> bytes that contains SJIS portions to a Unicode string, you
    Craig> can type: uniString = new String(sjisByteArray, "SJIS");

    Craig> To convert that from Unicode to a JIS byte array: byte[]
    Craig> jisByteArray = uniString.getBytes("JIS");

Hey, this is way cool.  But this is the easy part.

    Craig> My problem right now is that I can't figure out how to get
    Craig> the servlet to receive a byte stream from the HTTP server.

The short answer is there is a primitive InputStream class in the
java.io package.  Use that.

    Craig> I have these methods:

Where?  Did you write them?  Inherit them?  Anyway....

What you probably need to do is
(1) Use the raw InputStream attached to the server.  Probably you can
    get your hands on it with getInputStream.
(2a) (Happy days are here again, Content-Encoding comes in handy
    again) Get the charset from the Content-Encoding header which is
    REQUIRED by the HTML/HTTP protocols.
(2b) (Oh damn, no such header) Use some kind of heuristic (OS of
    client eg) to get the charset (this loses generically, I bet,
    but would be a hack to start with) OR
(2c) bind that InputStream to some kind of FilteredStream that uses
    jconv.c-like code to get the charset (warning, this is not an
    algorithm in the sense that there are byte streams that are both
    EUC-8 and SJIS; the shorter, the more likely this is to happen)
    AND back this up by being ready to try multiple charset
    interpretations if your first guess fails (eg, if the file doesn't 
    exist under the SJIS interpretation, try converting to EUC)
(3) Do what you've been doing:

    Craig> I've been using  getParameter("name");
    Craig> For example, if an HTML form is 

but you read from the pre-converted stream, not the raw stream, and
you must be prepared to backtrack to a second-guess charset if
interpreting a request as given charset fails.

Warning: this could be potentially very expensive computationally.
Eg, if it's a name of a person that you're ADDING to a database, you'd
have to do a lookup in the Edict "names" dictionary, say.  However, my
experience has been that the human eye IMMEDIATELY recognizes bogus
encodings, mainly because the displayed text contains lots of hankaku
kana.  Provide a warning on the form that "use of hankaku kana will
possibly cause you to get bogus results", and then reject any encoding
that implies implicit hankaku.  (Ie, permit things like <FONT
KANAWIDTH="HANKAKU"> or the escape sequence equivalent.)  For general
text processing, you might be able to use edict.el- or Wnn-like
grammatical tests to check that it "looks like" Japanese.  (This would 
probably barf on names of government departments and such though
without that expensive dictionary lookup.)

I'm sorry it's such a mess---welcome to Japanese information
processing.  Blame JIS, not me.

    Craig> Thus, the String is not usuable.

Not without external encoding information.  (External in the sense
that you need to pass it to the getBytes(String encoding) method.)

    Craig> I'm going to take a rest on this Japanese problem and look
    Craig> at kaffe again after dinner.

I'm going to lunch after drinking some kaffe.  :-)

Steve
Next TLUG meeting is Saturday October 11, 1997
-----------------------------------------------------------------
a word from the sponsor will appear below
TWICS - Japan's First Public-Access Internet System.
www.twics.com  info@example.com  Tel:03-3351-5977  Fax:03-3353-6096

References:
- Re: tlug: Java and Japanese e-mail
  - From: Craig Oda <craig@example.com>

Prev by Date: tlug: [A] Kaffe 0.9.1
Next by Date: RE: tlug: Java and Japanese e-mail
Prev by thread: Re: tlug: Java and Japanese e-mail
Next by thread: Re: tlug: Java and Japanese e-mail (trying again, key bounceon ^C yarrrgh)
Index(es):
- Date
- Thread

Home | Main Index | Thread Index