Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [OT] Strip Kanji from a document for study purposes



Hi!

On Tue, 18 Jul 2006 18:54:59 +0900
Dave M G <martin@example.com> wrote:

> What I'd like to do is take a Japanese document and convert it into a 
> list of the kanji included, and a list of words. Ideally repetitions 
> would be removed, as would particles and other grammatical
> inflections. Hiragana and katakana words could be dropped too.
> ...
> Any thoughts or comments on how to achieve this would be appreciated.

You will need to tokenize the japanese text, kakasi is said to to be
able to do this, though I never used it.
I doubt that there is an existing software for the task that you
described, you'll probably need to do some scripting/programming
yourself.


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links