Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Counting hiragana in EUC



On Mon, Feb 05, 2001 at 09:49:09AM +0900, Jim Breen wrote:
> In the text-glossing function in my dictionary server, I take a
> quick-and-dirty approach of (a) ignoring hiragana entirely, on the
> grounds that (i) the user should know the particles, stock words &
> phrases, etc already, and (ii) it's all too hard,

Fine for you. :) What I'm doing is trying to advance the state of the art in
input method environments. When you're dealing with IMEs, you effectively have
an input stream of unsegmented hiragana, and your aim is to produce
kanamajirabun. So, instead of using a simple dictionary lookup, I reckon you
could get a lot better accuracy by segmenting the input into kanji compounds
and non-kanji, and then using a selection algorithm to get the
appropriate kanji. My segmentation algorithm is working nicely on English
input, so I'm kinda giddy right now, but I haven't exposed it to Japanese text
just yet. That's tomorrow's job.

-- 
<Twofish> Pokemon seems an evil concept. Kid hunts animals, and takes
them from the wild into captivity, where he trains them to fight, and
then fights them to the death against other people's pokemon. Doesn't
this remind you of say, cock fighting?


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links