Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][tlug] [OT] Strip Kanji from a document for study purposes
- Date: Tue, 18 Jul 2006 12:40:03 +0200
- From: Marcus Metzler <mocm@example.com>
- Subject: [tlug] [OT] Strip Kanji from a document for study purposes
- References: <44BCAFF3.6030604@example.com>
>>>>> "Dave" == Dave M G <Dave> writes: Dave> TLUG, (This message includes utf8 encoded Japanese text) Dave> Apologies for being off the topic of Linux, but the I'm Dave> hoping I can draw upon the undeniable expertise in handling Dave> Japanese encoded documents present here on this list. For Dave> the task I describe below, the members of this list may be Dave> the foremost authorities. Dave> There may be existing software that does what I'm looking Dave> for, but I haven't seen it. If you know of a suitable Linux Dave> based application, please let me know. Dave> What I'd like to do is take a Japanese document and convert Dave> it into a list of the kanji included, and a list of Dave> words. Ideally repetitions would be removed, as would Dave> particles and other grammatical inflections. Hiragana and Dave> katakana words could be dropped too. Dave> My ultimate goal would be to create a list that has Dave> definitions and readings. But, if that's too complex, then Dave> the next best thing would be to just have a list of words Dave> and individual kanji that I could look up on my own (perhaps Dave> with some kind of clever use of regular expressions or Dave> something?) Dave> So, for example, take the following Japanese text: Dave> これは日本語だ。もう一回「日本語」が書いてある。この文章から、 Dave> 順番で漢字 の表を作りたい。出来る、かな? Dave> Ideally I'd like to make two documents from it. The first Dave> would be a list of the words:日本語 - (にほんご) - Japanese Dave> language一回 - (いっかい) - One time書く - (かく) - Write文 Dave> 章 - (ぶんしょう) - text順番 - (じゅんばん) - in order漢字 - Dave> (かんじ) - kanji characters表 - (ひょう) - chart作る - (つく Dave> る) - to make出来る - (できる) - possible Dave> I can see there might be complexities, like, for example, Dave> where 書いてあ る becomes 書く. However, I'm not expecting Dave> perfection. If it outputed 書いて or some other variant, Dave> that wouldn't be the end of the world. Dave> Also, I realize that outputting dictionary defintions and Dave> hiragana phonetics might be less clean than what my example Dave> shows. But as close to that as possible would be nice. Dave> The second list would be just the kanji:日 - (に、ひ、にち) Dave> - Sun本 - (ほん、き、ぎ) - Tree語 - (ご、はなす) - Language, Dave> talk一 - (いち、いっ) - One回 - (まわる、かい) - (Number of) Dave> times. Dave> ... and so on. I won't reproduce them all, as it's clear Dave> what I'm after. Dave> Again, I figure what I'm using as an example would be a Dave> little messier in practice. What with all the 'kun' and 'on' Dave> readings, and multiple meanings. Dave> But, given that programs like rikaichan do such an admirable Dave> job of pulling definitions out of text, surely going one Dave> step further and trapping that output into some kind of list Dave> is do-able. Dave> If worse comes to worse, as mentioned before, if what I Dave> describe is too robust, then somehow extracting a simple Dave> list of words or kanji, or even just one of those, would be Dave> good. Dave> Any thoughts or comments on how to achieve this would be Dave> appreciated. I don't know if it does everything you are looking for, but jgloss certainly has some of the capabilitie you want. You can find it at http://jgloss.sourceforge.net/. Marcus -- /--------------------------------------------------------------------\ | Dr. Marcus O.C. Metzler | | | mocm@example.com | http://www.metzlerbros.de/ | \--------------------------------------------------------------------/ |>>> Quis custodiet ipsos custodes <<<|
- References:
- [tlug] [OT] Strip Kanji from a document for study purposes
- From: Dave M G
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Next by Date: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Previous by thread: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Next by thread: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links