Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: tlug: Java and Japanese e-mail
- To: tlug@example.com
- Subject: Re: tlug: Java and Japanese e-mail
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Date: Mon, 25 Aug 1997 11:44:51 +0900
- In-reply-to: Your message of "Sun, 24 Aug 1997 18:58:57 +0900." <Pine.HPP.3.95.970824184319.20376A-100000@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug
-------------------------------------------------------- tlug note from "Stephen J. Turnbull" <turnbull@example.com> -------------------------------------------------------- >>>>> "Craig" == Craig Oda <craig@example.com> writes: >> From Todd Rudick, I got this bit of information: Craig> Java uses Unicode internally, and has full support for many Craig> encodings externally. For example, to convert an array of Craig> bytes that contains SJIS portions to a Unicode string, you Craig> can type: uniString = new String(sjisByteArray, "SJIS"); Craig> To convert that from Unicode to a JIS byte array: byte[] Craig> jisByteArray = uniString.getBytes("JIS"); Hey, this is way cool. But this is the easy part. Craig> My problem right now is that I can't figure out how to get Craig> the servlet to receive a byte stream from the HTTP server. The short answer is there is a primitive InputStream class in the java.io package. Use that. Craig> I have these methods: Where? Did you write them? Inherit them? Anyway.... What you probably need to do is (1) Use the raw InputStream attached to the server. Probably you can get your hands on it with getInputStream. (2a) (Happy days are here again, Content-Encoding comes in handy again) Get the charset from the Content-Encoding header which is REQUIRED by the HTML/HTTP protocols. (2b) (Oh damn, no such header) Use some kind of heuristic (OS of client eg) to get the charset (this loses generically, I bet, but would be a hack to start with) OR (2c) bind that InputStream to some kind of FilteredStream that uses jconv.c-like code to get the charset (warning, this is not an algorithm in the sense that there are byte streams that are both EUC-8 and SJIS; the shorter, the more likely this is to happen) AND back this up by being ready to try multiple charset interpretations if your first guess fails (eg, if the file doesn't exist under the SJIS interpretation, try converting to EUC) (3) Do what you've been doing: Craig> I've been using getParameter("name"); Craig> For example, if an HTML form is but you read from the pre-converted stream, not the raw stream, and you must be prepared to backtrack to a second-guess charset if interpreting a request as given charset fails. Warning: this could be potentially very expensive computationally. Eg, if it's a name of a person that you're ADDING to a database, you'd have to do a lookup in the Edict "names" dictionary, say. However, my experience has been that the human eye IMMEDIATELY recognizes bogus encodings, mainly because the displayed text contains lots of hankaku kana. Provide a warning on the form that "use of hankaku kana will possibly cause you to get bogus results", and then reject any encoding that implies implicit hankaku. (Ie, permit things like <FONT KANAWIDTH="HANKAKU"> or the escape sequence equivalent.) For general text processing, you might be able to use edict.el- or Wnn-like grammatical tests to check that it "looks like" Japanese. (This would probably barf on names of government departments and such though without that expensive dictionary lookup.) I'm sorry it's such a mess---welcome to Japanese information processing. Blame JIS, not me. Craig> Thus, the String is not usuable. Not without external encoding information. (External in the sense that you need to pass it to the getBytes(String encoding) method.) Craig> I'm going to take a rest on this Japanese problem and look Craig> at kaffe again after dinner. I'm going to lunch after drinking some kaffe. :-) Steve Next TLUG meeting is Saturday October 11, 1997 ----------------------------------------------------------------- a word from the sponsor will appear below TWICS - Japan's First Public-Access Internet System. www.twics.com info@example.com Tel:03-3351-5977 Fax:03-3353-6096
- References:
- Re: tlug: Java and Japanese e-mail
- From: Craig Oda <craig@example.com>
Home | Main Index | Thread Index
- Prev by Date: tlug: [A] Kaffe 0.9.1
- Next by Date: RE: tlug: Java and Japanese e-mail
- Prev by thread: Re: tlug: Java and Japanese e-mail
- Next by thread: Re: tlug: Java and Japanese e-mail (trying again, key bounceon ^C yarrrgh)
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links