Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] [OT] Strip Kanji from a document for study purposes



TLUG,

(This message includes utf8 encoded Japanese text)

Apologies for being off the topic of Linux, but the I'm hoping I can draw upon the undeniable expertise in handling Japanese encoded documents present here on this list. For the task I describe below, the members of this list may be the foremost authorities.

There may be existing software that does what I'm looking for, but I haven't seen it. If you know of a suitable Linux based application, please let me know.

What I'd like to do is take a Japanese document and convert it into a list of the kanji included, and a list of words. Ideally repetitions would be removed, as would particles and other grammatical inflections. Hiragana and katakana words could be dropped too.

My ultimate goal would be to create a list that has definitions and readings. But, if that's too complex, then the next best thing would be to just have a list of words and individual kanji that I could look up on my own (perhaps with some kind of clever use of regular expressions or something?)

So, for example, take the following Japanese text:

これは日本語だ。もう一回「日本語」が書いてある。この文章から、順番で漢字 の表を作りたい。出来る、かな?

Ideally I'd like to make two documents from it. The first would be a list of the words:
日本語 - (にほんご) - Japanese language
一回 - (いっかい) - One time
書く - (かく) - Write
文章 - (ぶんしょう) - text
順番 - (じゅんばん) - in order
漢字 - (かんじ) - kanji characters
表 - (ひょう) - chart
作る - (つくる) - to make
出来る - (できる) - possible

I can see there might be complexities, like, for example, where 書いてあ る becomes 書く. However, I'm not expecting perfection. If it outputed 書いて or some other variant, that wouldn't be the end of the world.

Also, I realize that outputting dictionary defintions and hiragana phonetics might be less clean than what my example shows. But as close to that as possible would be nice.

The second list would be just the kanji:
日 - (に、ひ、にち) - Sun
本 - (ほん、き、ぎ) - Tree
語 - (ご、はなす) - Language, talk
一 - (いち、いっ) - One
回 - (まわる、かい) - (Number of) times.

... and so on. I won't reproduce them all, as it's clear what I'm after.

Again, I figure what I'm using as an example would be a little messier in practice. What with all the 'kun' and 'on' readings, and multiple meanings.

But, given that programs like rikaichan do such an admirable job of pulling definitions out of text, surely going one step further and trapping that output into some kind of list is do-able.

If worse comes to worse, as mentioned before, if what I describe is too robust, then somehow extracting a simple list of words or kanji, or even just one of those, would be good.

Any thoughts or comments on how to achieve this would be appreciated.

Thank you. Please contact me off list if this is not of interest for the list as whole. If the moderators decide to inform me to not discuss this kind of thing here, please accept my apologies in advance and I'll refrain in the future.

--
Dave M G

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links