Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Date: Tue, 18 Jul 2006 20:01:47 +0200
- From: Botond Botyanszki <tlug@example.com>
- Subject: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- References: <44BCAFF3.6030604@example.com> <20060718122803.058f4525.jep200404@example.com>
On Tue, 18 Jul 2006 12:28:03 -0400 Jim <jep200404@example.com> wrote: > > What I'd like to do is take a Japanese document and convert it into a > > list of the kanji included, and a list of words. Ideally repetitions > > would be removed, as would particles and other grammatical inflections. > > Hiragana and katakana words could be dropped too. > Removing particles and other grammatical inflections might be > a significant project in itself. Removing particles and inflections isn't that hard because these are hiragana following the kanji. Tokenizing is the tricky part where a sentence can contain words in kanji which are not delimited by kana. Consider this string for example: 技術的課題 You need to split this into two words before you can feed it to a dictionary.Attachment: signature.asc
Description: PGP signature
- Follow-Ups:
- References:
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Next by Date: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Previous by thread: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Next by thread: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links