[tlug] font encoding question

Date: Sat, 16 Jun 2007 08:30:58 -0700
From: steven smith <sjs@example.com>
Subject: [tlug] font encoding question
User-agent: Thunderbird 2.0.0.0 (Windows/20070326)

Hi all

I'm about to do my first CGI program as a result of an
earlier discussion.  In that I asked about onyomi/kunyomi
and whether I should attempt to memorize them.  The answer
was yes and after paying more attention to the kanji and how
 they were pronounced in words, I understand why.  But how
to memorize at least a couple of hundred
kanji/onyomi/kunyomi... that's a problem.

I'm using a nice little opensource memorization program
called Mnemosyne that allows import of it's "flash cards" in
various formats including XML.  What I am want to do is
generate the XML file using a CGI with various radio-buttons
to determine if output is to contain, and a text window
where the user input their kanji and info from KANJD212 to
decipher the kanji.

I was assuming I'd just split the input from the text window
and do a simple table lookup.  All of this is to be done in
perl and I've done all of it except the lookup before as
various little stand-alone utilities.  I haven't done a lot
of CGI recently and also done little with UTF-8, but it
doesn't sound too difficult.  I expected to do a simple
compare on the input character values and throw out any
thing that didn't look like kanji.

Then a note went by between Josh Glover and Jim Breen about
problems Josh was having.  It turned out that part of the
problem is font encoding, and I hadn't even considered font
encoding.  I just assumed that the user's input would be
UTF-8 like my script.

So here is the questions
1) How do I handle the user input.  I plan on storing
KANJD212 in a hash with the kanji as keys.  Can I just split
the table input and throw out anything not in the KANJD212 hash?
2) how do I handle errors.

What I'm leaning toward is just saying "input must be utf-8"
and praying that it is.  Doing a split on the input to pull
out the individual characters and throwing out white space.
I'd then look through the result and compare these against
the KANJD212 input (stored as a hash) and warn the user that
characters didn't convert if there are problems.

Does this sound like a good approach, and is it sufficient?

I did few google searches on "font encoding" and determine,
but nothing interesting turned up.

My background is that I have several years of writing perl
and feel confident of my abilities, but haven't done much
CGI and almost nothing using non-ASCII.  This is new stuff
for me.  And to be honest, right now, my main push is to
learn reading/writing/speaking enough Japanese to build a
foundation to learn on.  I'd like to come over there
(California -> Japan) to work for a couple of years before
retiring -- whatever that means.  This is a utility I
thought the community might find useful.

Thanks
Steve S.

Follow-Ups:
- Re: [tlug] font encoding question
  - From: Edward Wright

Prev by Date: Re: [tlug] Giving a program priority briefly
Next by Date: Re: [tlug] Re: WWWJDIC backdoor issue
Previous by thread: Re: [tlug] Re: WWWJDIC backdoor issue
Next by thread: Re: [tlug] font encoding question
Index(es):
- Date
- Thread

Home | Main Index | Thread Index