
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Date: Wed, 19 Jul 2006 00:06:01 +0900
- From: Nikolay Elenkov <goibniu@example.com>
- Subject: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- References: <44BCAFF3.6030604@example.com>
- User-agent: Thunderbird 1.5.0.2 (X11/20060501)
Dave M G wrote:
There may be existing software that does what I'm looking for, but I
haven't seen it. If you know of a suitable Linux based application,
please let me know.
What I'd like to do is take a Japanese document and convert it into a
list of the kanji included, and a list of words. Ideally repetitions
would be removed, as would particles and other grammatical inflections.
Hiragana and katakana words could be dropped too.
Try Juman:
http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html
Here's a CGI to try it out:
http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman-form.html
It doesn't do everything you want out of the box, but it's pretty
powerful and with a bit of scripting and piping you should be able to
get want you want. (it has a Perl module, I think)
Home |
Main Index |
Thread Index