Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Counting hiragana in EUC



>> From: Andreas Marcel Riechert <riechert@example.com>
>> Date: 04 Feb 2001 17:24:38 +0100
>> 
>> Simon Cozens <simon@example.com> writes:
>>  
>> Segementing a Japanese phrase into words (read "word" as lemmata) is for sure
>> very important for e.g an automatic dictionary-lookup routine.

Tell me about it  8-)}

>> For segmenting incoming hiragana text in a meaningful way, Part-of-speech/
>> morphological segmentation or bunsetsu  segmentation seems IMHO to be a
>> more promising approach, but I am allways happy to get new creative input.

In the text-glossing function in my dictionary server, I take a
quick-and-dirty approach of (a) ignoring hiragana entirely, on the
grounds that (i) the user should know the particles, stock words &
phrases, etc already, and (ii) it's all too hard, (b) driving the rest
through the dictionary lookup process, i.e. I rely on a rolling
dictionary match to indicate the segmentation for me. This means that my
segmenter turns out to be about 50 lines of C over and above the
dictionary code that is there already. A far cry for the massive 
morphological analysis software around.

I'm planning to extend all this to (a) handle a constrained set of 
hiragana-only words, where they can unambiguously be identified, (b)
address the issue of single-kanji prefixes and suffixes.

Jim
-- 
Jim Breen  [jwb@example.com  http://www.csse.monash.edu.au/~jwb/]
Visiting Professor, Institute for the Study of Languages and Cultures of 
Asia and Africa, Tokyo University of Foreign Studies, Japan
+81 3 5974 3880         [$B%8%`!&%V%j!<%s(B@$BEl5~30Bg(B]


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links