Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] How to make a current running kanji compound list from the news
- Date: Fri, 22 Jul 2011 11:52:58 +0900
- From: Raymond Wan <rwan.kyoto@example.com>
- Subject: Re: [tlug] How to make a current running kanji compound list from the news
- References: <CA+kCxRZTuMYGzURV2Rm2k4F59oymJ7FsKiaK6bzYqb812hfycQ@example.com>
- User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20110702 Icedove/3.0.11
Hi Dave, On 22/07/11 11:18, Martin G wrote:In the course of my studies, I came across mention of this book, which lists 1000 kanji compounds useful for reading the news: http://www.amazon.com/dp/0804809194/ I'm thinking about getting it, even though it's out of print and hasn't been updated in thirty years.I think there are surely more up to date books. For example: http://www.amazon.co.jp/dp/4789012824 http://www.amazon.co.jp/dp/4757413300 [Note: Neither of which I've used.]I know one can create graphs of trends in search terms on Google, and make lists(?) of most popular search terms. Which is why I have thisGoogle can create such a list because of people entering words in their search box.However, since they have also created an index, if they wanted to find out how often a word appears, I suppose they could do that very easily.Would there be a way to: 1. Select a site, set of sites, or possibly an aggregate site to use as source material. 2. Set a start and end time to frame a span of time in which to select news articles. 3. Create a list of the most used compounds within that search criteria 4. This step might be a doozy - cross reference that list with WWWJDIC to get readings and definitions. 5. Output a CSV or text file or something with the compounds, readings, and definitions in three columns. ...?I guess what you're suggesting isn't too hard and honestly, it would be interesting to know what words are popular. I think I can do the above steps for English, but not for Japanese news so I'll let someone else help you.I'd probably do the above with a cronjob, wget, perl with some kind of hash table... How to cut Japanese text into compounds without breaking up sensible ones is something I don't know how to do and is perhaps the hardest part...But some alternatives for you might be:a) Some of the test collections here http://research.nii.ac.jp/ntcir/permission/perm-en.html have Japanese newspapers. I don't know how to get permission to access the collections, though -- you'll have to go read the agreement forms.b) You might want to learn kanji through the Kanji Kentei tests. The order of kanji is slightly different from JLPT and with 10 or so levels, they're broken down into more manageable units. I don't know how old a Japanese child should be before s/he is expected to be able to read a newspaper [I can't remember my childhood], but surely you don't have to do all 10 levels...Ray
- References:
Home | Main Index | Thread Index
- Prev by Date: [tlug] How to make a current running kanji compound list from the news
- Next by Date: Re: [tlug] How to make a current running kanji compound list from the news
- Previous by thread: [tlug] How to make a current running kanji compound list from the news
- Next by thread: Re: [tlug] How to make a current running kanji compound list from the news
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links