
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] [OT] Strip Kanji from a document for study purposes
- Date: Tue, 18 Jul 2006 18:54:59 +0900
- From: Dave M G <martin@example.com>
- Subject: [tlug] [OT] Strip Kanji from a document for study purposes
- User-agent: Thunderbird 1.5.0.4 (X11/20060615)
TLUG,
(This message includes utf8 encoded Japanese text)
Apologies for being off the topic of Linux, but the I'm hoping I can
draw upon the undeniable expertise in handling Japanese encoded
documents present here on this list. For the task I describe below, the
members of this list may be the foremost authorities.
There may be existing software that does what I'm looking for, but I
haven't seen it. If you know of a suitable Linux based application,
please let me know.
What I'd like to do is take a Japanese document and convert it into a
list of the kanji included, and a list of words. Ideally repetitions
would be removed, as would particles and other grammatical inflections.
Hiragana and katakana words could be dropped too.
My ultimate goal would be to create a list that has definitions and
readings. But, if that's too complex, then the next best thing would be
to just have a list of words and individual kanji that I could look up
on my own (perhaps with some kind of clever use of regular expressions
or something?)
So, for example, take the following Japanese text:
これは日本語だ。もう一回「日本語」が書いてある。この文章から、順番で漢字
の表を作りたい。出来る、かな?
Ideally I'd like to make two documents from it. The first would be a
list of the words:
日本語 - (にほんご) - Japanese language
一回 - (いっかい) - One time
書く - (かく) - Write
文章 - (ぶんしょう) - text
順番 - (じゅんばん) - in order
漢字 - (かんじ) - kanji characters
表 - (ひょう) - chart
作る - (つくる) - to make
出来る - (できる) - possible
I can see there might be complexities, like, for example, where 書いてあ
る becomes 書く. However, I'm not expecting perfection. If it outputed
書いて or some other variant, that wouldn't be the end of the world.
Also, I realize that outputting dictionary defintions and hiragana
phonetics might be less clean than what my example shows. But as close
to that as possible would be nice.
The second list would be just the kanji:
日 - (に、ひ、にち) - Sun
本 - (ほん、き、ぎ) - Tree
語 - (ご、はなす) - Language, talk
一 - (いち、いっ) - One
回 - (まわる、かい) - (Number of) times.
... and so on. I won't reproduce them all, as it's clear what I'm after.
Again, I figure what I'm using as an example would be a little messier
in practice. What with all the 'kun' and 'on' readings, and multiple
meanings.
But, given that programs like rikaichan do such an admirable job of
pulling definitions out of text, surely going one step further and
trapping that output into some kind of list is do-able.
If worse comes to worse, as mentioned before, if what I describe is too
robust, then somehow extracting a simple list of words or kanji, or even
just one of those, would be good.
Any thoughts or comments on how to achieve this would be appreciated.
Thank you. Please contact me off list if this is not of interest for the
list as whole. If the moderators decide to inform me to not discuss this
kind of thing here, please accept my apologies in advance and I'll
refrain in the future.
--
Dave M G
Home |
Main Index |
Thread Index