TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] [OT] Strip Kanji from a document for study purposes

Date: Tue, 18 Jul 2006 18:54:59 +0900

From: Dave M G <martin@example.com>

Subject: [tlug] [OT] Strip Kanji from a document for study purposes

User-agent: Thunderbird 1.5.0.4 (X11/20060615)
TLUG,

(This message includes utf8 encoded Japanese text)
Apologies for being off the topic of Linux, but the I'm hoping I candraw upon the undeniable expertise in handling Japanese encodeddocuments present here on this list. For the task I describe below, themembers of this list may be the foremost authorities.
There may be existing software that does what I'm looking for, but Ihaven't seen it. If you know of a suitable Linux based application,please let me know.
What I'd like to do is take a Japanese document and convert it into alist of the kanji included, and a list of words. Ideally repetitionswould be removed, as would particles and other grammatical inflections.Hiragana and katakana words could be dropped too.
My ultimate goal would be to create a list that has definitions andreadings. But, if that's too complex, then the next best thing would beto just have a list of words and individual kanji that I could look upon my own (perhaps with some kind of clever use of regular expressionsor something?)
So, for example, take the following Japanese text:
これは日本語だ。もう一回「日本語」が書いてある。この文章から、順番で漢字の表を作りたい。出来る、かな？
Ideally I'd like to make two documents from it. The first would be alist of the words:
日本語 - (にほんご) - Japanese language
一回 - (いっかい) - One time
書く - (かく) - Write
文章 - (ぶんしょう) - text
順番 - (じゅんばん) - in order
漢字 - (かんじ) - kanji characters
表 - (ひょう) - chart
作る - (つくる) - to make
出来る - (できる) - possible
I can see there might be complexities, like, for example, where 書いてある becomes 書く. However, I'm not expecting perfection. If it outputed書いて or some other variant, that wouldn't be the end of the world.
Also, I realize that outputting dictionary defintions and hiraganaphonetics might be less clean than what my example shows. But as closeto that as possible would be nice.
The second list would be just the kanji:
日 - (に、ひ、にち) - Sun
本 - (ほん、き、ぎ) - Tree
語 - (ご、はなす) - Language, talk
一 - (いち、いっ) - One
回 - (まわる、かい) - (Number of) times.

... and so on. I won't reproduce them all, as it's clear what I'm after.
Again, I figure what I'm using as an example would be a little messierin practice. What with all the 'kun' and 'on' readings, and multiplemeanings.
But, given that programs like rikaichan do such an admirable job ofpulling definitions out of text, surely going one step further andtrapping that output into some kind of list is do-able.
If worse comes to worse, as mentioned before, if what I describe is toorobust, then somehow extracting a simple list of words or kanji, or evenjust one of those, would be good.
Any thoughts or comments on how to achieve this would be appreciated.
Thank you. Please contact me off list if this is not of interest for thelist as whole. If the moderators decide to inform me to not discuss thiskind of thing here, please accept my apologies in advance and I'llrefrain in the future.
--
Dave M G
Follow-Ups:

Re: [tlug] [OT] Strip Kanji from a document for study purposes
From: Godwin Stewart

Re: [tlug] [OT] Strip Kanji from a document for study purposes
From: Botond Botyanszki

[tlug] [OT] Strip Kanji from a document for study purposes
From: Marcus Metzler

Re: [tlug] [OT] Strip Kanji from a document for study purposes
From: Nikolay Elenkov

Re: [tlug] [OT] Strip Kanji from a document for study purposes
From: Stephen J. Turnbull

Re: [tlug] [OT] Strip Kanji from a document for study purposes
From: Jim

Prev by Date: Re: [tlug] VMWare/Virtualserver free

Next by Date: Re: [tlug] [OT] Strip Kanji from a document for study purposes

Previous by thread: Re: [tlug] VMWare/Virtualserver free

Next by thread: Re: [tlug] [OT] Strip Kanji from a document for study purposes

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links