Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][tlug] [OT] Strip Kanji from a document for study purposes
- Date: Tue, 18 Jul 2006 18:54:59 +0900
- From: Dave M G <martin@example.com>
- Subject: [tlug] [OT] Strip Kanji from a document for study purposes
- User-agent: Thunderbird 1.5.0.4 (X11/20060615)
TLUG, (This message includes utf8 encoded Japanese text)Apologies for being off the topic of Linux, but the I'm hoping I can draw upon the undeniable expertise in handling Japanese encoded documents present here on this list. For the task I describe below, the members of this list may be the foremost authorities.There may be existing software that does what I'm looking for, but I haven't seen it. If you know of a suitable Linux based application, please let me know.What I'd like to do is take a Japanese document and convert it into a list of the kanji included, and a list of words. Ideally repetitions would be removed, as would particles and other grammatical inflections. Hiragana and katakana words could be dropped too.My ultimate goal would be to create a list that has definitions and readings. But, if that's too complex, then the next best thing would be to just have a list of words and individual kanji that I could look up on my own (perhaps with some kind of clever use of regular expressions or something?)So, for example, take the following Japanese text:これは日本語だ。もう一回「日本語」が書いてある。この文章から、順番で漢字 の表を作りたい。出来る、かな?Ideally I'd like to make two documents from it. The first would be a list of the words:日本語 - (にほんご) - Japanese language 一回 - (いっかい) - One time 書く - (かく) - Write 文章 - (ぶんしょう) - text 順番 - (じゅんばん) - in order 漢字 - (かんじ) - kanji characters 表 - (ひょう) - chart 作る - (つくる) - to make 出来る - (できる) - possibleI can see there might be complexities, like, for example, where 書いてあ る becomes 書く. However, I'm not expecting perfection. If it outputed 書いて or some other variant, that wouldn't be the end of the world.Also, I realize that outputting dictionary defintions and hiragana phonetics might be less clean than what my example shows. But as close to that as possible would be nice.The second list would be just the kanji: 日 - (に、ひ、にち) - Sun 本 - (ほん、き、ぎ) - Tree 語 - (ご、はなす) - Language, talk 一 - (いち、いっ) - One 回 - (まわる、かい) - (Number of) times. ... and so on. I won't reproduce them all, as it's clear what I'm after.Again, I figure what I'm using as an example would be a little messier in practice. What with all the 'kun' and 'on' readings, and multiple meanings.But, given that programs like rikaichan do such an admirable job of pulling definitions out of text, surely going one step further and trapping that output into some kind of list is do-able.If worse comes to worse, as mentioned before, if what I describe is too robust, then somehow extracting a simple list of words or kanji, or even just one of those, would be good.Any thoughts or comments on how to achieve this would be appreciated.Thank you. Please contact me off list if this is not of interest for the list as whole. If the moderators decide to inform me to not discuss this kind of thing here, please accept my apologies in advance and I'll refrain in the future.-- Dave M G
- Follow-Ups:
- Re: [tlug] [OT] Strip Kanji from a document for study purposes
- From: Godwin Stewart
- Re: [tlug] [OT] Strip Kanji from a document for study purposes
- From: Botond Botyanszki
- [tlug] [OT] Strip Kanji from a document for study purposes
- From: Marcus Metzler
- Re: [tlug] [OT] Strip Kanji from a document for study purposes
- From: Nikolay Elenkov
- Re: [tlug] [OT] Strip Kanji from a document for study purposes
- From: Stephen J. Turnbull
- Re: [tlug] [OT] Strip Kanji from a document for study purposes
- From: Jim
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] VMWare/Virtualserver free
- Next by Date: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Previous by thread: Re: [tlug] VMWare/Virtualserver free
- Next by thread: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links