Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: Counting hiragana in EUC
- To: tlug@example.com
- Subject: Re: Counting hiragana in EUC
- From: jwb@example.com (Jim Breen)
- Date: Mon, 5 Feb 2001 09:49:09 +0900 (JST)
- Reply-To: tlug@example.com
- Resent-From: tlug@example.com
- Resent-Message-ID: <Q-eseD.A.RDG.4gff6@example.com>
- Resent-Sender: tlug-request@example.com
>> From: Andreas Marcel Riechert <riechert@example.com> >> Date: 04 Feb 2001 17:24:38 +0100 >> >> Simon Cozens <simon@example.com> writes: >> >> Segementing a Japanese phrase into words (read "word" as lemmata) is for sure >> very important for e.g an automatic dictionary-lookup routine. Tell me about it 8-)} >> For segmenting incoming hiragana text in a meaningful way, Part-of-speech/ >> morphological segmentation or bunsetsu segmentation seems IMHO to be a >> more promising approach, but I am allways happy to get new creative input. In the text-glossing function in my dictionary server, I take a quick-and-dirty approach of (a) ignoring hiragana entirely, on the grounds that (i) the user should know the particles, stock words & phrases, etc already, and (ii) it's all too hard, (b) driving the rest through the dictionary lookup process, i.e. I rely on a rolling dictionary match to indicate the segmentation for me. This means that my segmenter turns out to be about 50 lines of C over and above the dictionary code that is there already. A far cry for the massive morphological analysis software around. I'm planning to extend all this to (a) handle a constrained set of hiragana-only words, where they can unambiguously be identified, (b) address the issue of single-kanji prefixes and suffixes. Jim -- Jim Breen [jwb@example.com http://www.csse.monash.edu.au/~jwb/] Visiting Professor, Institute for the Study of Languages and Cultures of Asia and Africa, Tokyo University of Foreign Studies, Japan +81 3 5974 3880 [$B%8%`!&%V%j!<%s(B@$BEl5~30Bg(B]
- Follow-Ups:
- Re: Counting hiragana in EUC
- From: Simon Cozens <simon@example.com>
Home | Main Index | Thread Index
- Prev by Date: Re: Counting hiragana in EUC
- Next by Date: Re: Counting hiragana in EUC
- Prev by thread: Re: Counting hiragana in EUC
- Next by thread: Re: Counting hiragana in EUC
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links