Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [OT] Strip Kanji from a document for study purposes



Dave wrote:

> What I'd like to do is take a Japanese document and convert it into a 
> list of the kanji included, and a list of words. Ideally repetitions 
> would be removed, as would particles and other grammatical inflections. 
> Hiragana and katakana words could be dropped too.

Here are a few crumbs of ideas. 

   tr ' \t\r' '\n\n\n' <document | grep <kanjiregex> | sort | uniq

I don't know how to craft a regex to pass only kanji. 

Removing particles and other grammatical inflections might be 
a significant project in itself. 



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links