
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] [OT] Strip Kanji from a document for study purposes
>>>>> "Dave" == Dave M G <Dave> writes:
Dave> TLUG, (This message includes utf8 encoded Japanese text)
Dave> Apologies for being off the topic of Linux, but the I'm
Dave> hoping I can draw upon the undeniable expertise in handling
Dave> Japanese encoded documents present here on this list. For
Dave> the task I describe below, the members of this list may be
Dave> the foremost authorities.
Dave> There may be existing software that does what I'm looking
Dave> for, but I haven't seen it. If you know of a suitable Linux
Dave> based application, please let me know.
Dave> What I'd like to do is take a Japanese document and convert
Dave> it into a list of the kanji included, and a list of
Dave> words. Ideally repetitions would be removed, as would
Dave> particles and other grammatical inflections. Hiragana and
Dave> katakana words could be dropped too.
Dave> My ultimate goal would be to create a list that has
Dave> definitions and readings. But, if that's too complex, then
Dave> the next best thing would be to just have a list of words
Dave> and individual kanji that I could look up on my own (perhaps
Dave> with some kind of clever use of regular expressions or
Dave> something?)
Dave> So, for example, take the following Japanese text:
Dave> これは日本語だ。もう一回「日本語」が書いてある。この文章から、
Dave> 順番で漢字 の表を作りたい。出来る、かな?
Dave> Ideally I'd like to make two documents from it. The first
Dave> would be a list of the words:日本語 - (にほんご) - Japanese
Dave> language一回 - (いっかい) - One time書く - (かく) - Write文
Dave> 章 - (ぶんしょう) - text順番 - (じゅんばん) - in order漢字 -
Dave> (かんじ) - kanji characters表 - (ひょう) - chart作る - (つく
Dave> る) - to make出来る - (できる) - possible
Dave> I can see there might be complexities, like, for example,
Dave> where 書いてあ る becomes 書く. However, I'm not expecting
Dave> perfection. If it outputed 書いて or some other variant,
Dave> that wouldn't be the end of the world.
Dave> Also, I realize that outputting dictionary defintions and
Dave> hiragana phonetics might be less clean than what my example
Dave> shows. But as close to that as possible would be nice.
Dave> The second list would be just the kanji:日 - (に、ひ、にち)
Dave> - Sun本 - (ほん、き、ぎ) - Tree語 - (ご、はなす) - Language,
Dave> talk一 - (いち、いっ) - One回 - (まわる、かい) - (Number of)
Dave> times.
Dave> ... and so on. I won't reproduce them all, as it's clear
Dave> what I'm after.
Dave> Again, I figure what I'm using as an example would be a
Dave> little messier in practice. What with all the 'kun' and 'on'
Dave> readings, and multiple meanings.
Dave> But, given that programs like rikaichan do such an admirable
Dave> job of pulling definitions out of text, surely going one
Dave> step further and trapping that output into some kind of list
Dave> is do-able.
Dave> If worse comes to worse, as mentioned before, if what I
Dave> describe is too robust, then somehow extracting a simple
Dave> list of words or kanji, or even just one of those, would be
Dave> good.
Dave> Any thoughts or comments on how to achieve this would be
Dave> appreciated.
I don't know if it does everything you are looking for, but jgloss
certainly has some of the capabilitie you want.
You can find it at http://jgloss.sourceforge.net/.
Marcus
--
/--------------------------------------------------------------------\
| Dr. Marcus O.C. Metzler | |
| mocm@example.com | http://www.metzlerbros.de/ |
\--------------------------------------------------------------------/
|>>> Quis custodiet ipsos custodes <<<|
Home |
Main Index |
Thread Index