Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [OT] Strip Kanji from a document for study purposes



Dave M G wrote:


There may be existing software that does what I'm looking for, but I haven't seen it. If you know of a suitable Linux based application, please let me know.

What I'd like to do is take a Japanese document and convert it into a list of the kanji included, and a list of words. Ideally repetitions would be removed, as would particles and other grammatical inflections. Hiragana and katakana words could be dropped too.


Try Juman:

http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html

Here's a CGI to try it out:

http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman-form.html

It doesn't do everything you want out of the box, but it's pretty powerful and with a bit of scripting and piping you should be able to get want you want. (it has a Perl module, I think)



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links