TLUG Mailing List

Mailing List Archive
Support open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Counting hiragana in EUC

To: tlug@example.com

Subject: Re: Counting hiragana in EUC

From: jwb@example.com (Jim Breen)

Date: Mon, 5 Feb 2001 09:49:09 +0900 (JST)

Reply-To: tlug@example.com

Resent-From: tlug@example.com

Resent-Message-ID: <Q-eseD.A.RDG.4gff6@example.com>

Resent-Sender: tlug-request@example.com
>> From: Andreas Marcel Riechert <riechert@example.com>
>> Date: 04 Feb 2001 17:24:38 +0100
>> 
>> Simon Cozens <simon@example.com> writes:
>>  
>> Segementing a Japanese phrase into words (read "word" as lemmata) is for sure
>> very important for e.g an automatic dictionary-lookup routine.

Tell me about it  8-)}

>> For segmenting incoming hiragana text in a meaningful way, Part-of-speech/
>> morphological segmentation or bunsetsu  segmentation seems IMHO to be a
>> more promising approach, but I am allways happy to get new creative input.

In the text-glossing function in my dictionary server, I take a
quick-and-dirty approach of (a) ignoring hiragana entirely, on the
grounds that (i) the user should know the particles, stock words &
phrases, etc already, and (ii) it's all too hard, (b) driving the rest
through the dictionary lookup process, i.e. I rely on a rolling
dictionary match to indicate the segmentation for me. This means that my
segmenter turns out to be about 50 lines of C over and above the
dictionary code that is there already. A far cry for the massive 
morphological analysis software around.

I'm planning to extend all this to (a) handle a constrained set of 
hiragana-only words, where they can unambiguously be identified, (b)
address the issue of single-kanji prefixes and suffixes.

Jim
-- 
Jim Breen  [jwb@example.com  http://www.csse.monash.edu.au/~jwb/]
Visiting Professor, Institute for the Study of Languages and Cultures of 
Asia and Africa, Tokyo University of Foreign Studies, Japan
+81 3 5974 3880         [$B%8%`!&%V%j!<%s(B@$BEl5~30Bg(B]
Follow-Ups:

Re: Counting hiragana in EUC
From: Simon Cozens <simon@example.com>

Prev by Date: Re: Counting hiragana in EUC

Next by Date: Re: Counting hiragana in EUC

Prev by thread: Re: Counting hiragana in EUC

Next by thread: Re: Counting hiragana in EUC

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links