Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] How to make a current running kanji compound list from the news

Hi Dave,

On 22/07/11 11:18, Martin G wrote:
In the course of my studies, I came across mention of this book, which
lists 1000 kanji compounds useful for reading the news:

I'm thinking about getting it, even though it's out of print and
hasn't been updated in thirty years.

I think there are surely more up to date books.  For example:

[Note:  Neither of which I've used.]

I know one can create graphs of trends in search terms on Google, and
make lists(?) of most popular search terms. Which is why I have this

Google can create such a list because of people entering words in their search box.

However, since they have also created an index, if they wanted to find out how often a word appears, I suppose they could do that very easily.

Would there be a way to:
1. Select a site, set of sites, or possibly an aggregate site to use
as source material.
2. Set a start and end time to frame a span of time in which to select
news articles.
3. Create a list of the most used compounds within that search criteria
4. This step might be a doozy - cross reference that list with WWWJDIC
to get readings and definitions.
5. Output a CSV or text file or something with the compounds,
readings, and definitions in three columns.

I guess what you're suggesting isn't too hard and honestly, it would be interesting to know what words are popular. I think I can do the above steps for English, but not for Japanese news so I'll let someone else help you.

I'd probably do the above with a cronjob, wget, perl with some kind of hash table... How to cut Japanese text into compounds without breaking up sensible ones is something I don't know how to do and is perhaps the hardest part...

But some alternatives for you might be:

a) Some of the test collections here have Japanese newspapers. I don't know how to get permission to access the collections, though -- you'll have to go read the agreement forms.

b) You might want to learn kanji through the Kanji Kentei tests. The order of kanji is slightly different from JLPT and with 10 or so levels, they're broken down into more manageable units. I don't know how old a Japanese child should be before s/he is expected to be able to read a newspaper [I can't remember my childhood], but surely you don't have to do all 10 levels...


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links