Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] [OT] Strip Kanji from a document for study purposes



>>>>> "Dave" == Dave M G <Dave> writes:

    Dave> TLUG, (This message includes utf8 encoded Japanese text)

    Dave> Apologies for being off the topic of Linux, but the I'm
    Dave> hoping I can draw upon the undeniable expertise in handling
    Dave> Japanese encoded documents present here on this list. For
    Dave> the task I describe below, the members of this list may be
    Dave> the foremost authorities.

    Dave> There may be existing software that does what I'm looking
    Dave> for, but I haven't seen it. If you know of a suitable Linux
    Dave> based application, please let me know.

    Dave> What I'd like to do is take a Japanese document and convert
    Dave> it into a list of the kanji included, and a list of
    Dave> words. Ideally repetitions would be removed, as would
    Dave> particles and other grammatical inflections. Hiragana and
    Dave> katakana words could be dropped too.

    Dave> My ultimate goal would be to create a list that has
    Dave> definitions and readings. But, if that's too complex, then
    Dave> the next best thing would be to just have a list of words
    Dave> and individual kanji that I could look up on my own (perhaps
    Dave> with some kind of clever use of regular expressions or
    Dave> something?)

    Dave> So, for example, take the following Japanese text:

    Dave> これは日本語だ。もう一回「日本語」が書いてある。この文章から、
    Dave> 順番で漢字 の表を作りたい。出来る、かな?

    Dave> Ideally I'd like to make two documents from it. The first
    Dave> would be a list of the words:日本語 - (にほんご) - Japanese
    Dave> language一回 - (いっかい) - One time書く - (かく) - Write文
    Dave> 章 - (ぶんしょう) - text順番 - (じゅんばん) - in order漢字 -
    Dave> (かんじ) - kanji characters表 - (ひょう) - chart作る - (つく
    Dave> る) - to make出来る - (できる) - possible

    Dave> I can see there might be complexities, like, for example,
    Dave> where 書いてあ る becomes 書く. However, I'm not expecting
    Dave> perfection. If it outputed 書いて or some other variant,
    Dave> that wouldn't be the end of the world.

    Dave> Also, I realize that outputting dictionary defintions and
    Dave> hiragana phonetics might be less clean than what my example
    Dave> shows. But as close to that as possible would be nice.

    Dave> The second list would be just the kanji:日 - (に、ひ、にち)
    Dave> - Sun本 - (ほん、き、ぎ) - Tree語 - (ご、はなす) - Language,
    Dave> talk一 - (いち、いっ) - One回 - (まわる、かい) - (Number of)
    Dave> times.

    Dave> ... and so on. I won't reproduce them all, as it's clear
    Dave> what I'm after.

    Dave> Again, I figure what I'm using as an example would be a
    Dave> little messier in practice. What with all the 'kun' and 'on'
    Dave> readings, and multiple meanings.

    Dave> But, given that programs like rikaichan do such an admirable
    Dave> job of pulling definitions out of text, surely going one
    Dave> step further and trapping that output into some kind of list
    Dave> is do-able.

    Dave> If worse comes to worse, as mentioned before, if what I
    Dave> describe is too robust, then somehow extracting a simple
    Dave> list of words or kanji, or even just one of those, would be
    Dave> good.

    Dave> Any thoughts or comments on how to achieve this would be
    Dave> appreciated.

I don't know if it does everything you are looking for, but jgloss
certainly has some of the capabilitie you want.

You can find it at http://jgloss.sourceforge.net/.

Marcus

-- 
/--------------------------------------------------------------------\
| Dr. Marcus O.C. Metzler        |                                   |
| mocm@example.com            | http://www.metzlerbros.de/        |
\--------------------------------------------------------------------/
 |>>>             Quis custodiet ipsos custodes                 <<<|


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links