
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] How to make a current running kanji compound list from the news
Hi Dave,
On 22/07/11 11:18, Martin G wrote:
In the course of my studies, I came across mention of this book, which
lists 1000 kanji compounds useful for reading the news:
http://www.amazon.com/dp/0804809194/
I'm thinking about getting it, even though it's out of print and
hasn't been updated in thirty years.
I think there are surely more up to date books. For example:
http://www.amazon.co.jp/dp/4789012824
http://www.amazon.co.jp/dp/4757413300
[Note: Neither of which I've used.]
I know one can create graphs of trends in search terms on Google, and
make lists(?) of most popular search terms. Which is why I have this
Google can create such a list because of people entering
words in their search box.
However, since they have also created an index, if they
wanted to find out how often a word appears, I suppose they
could do that very easily.
Would there be a way to:
1. Select a site, set of sites, or possibly an aggregate site to use
as source material.
2. Set a start and end time to frame a span of time in which to select
news articles.
3. Create a list of the most used compounds within that search criteria
4. This step might be a doozy - cross reference that list with WWWJDIC
to get readings and definitions.
5. Output a CSV or text file or something with the compounds,
readings, and definitions in three columns.
...?
I guess what you're suggesting isn't too hard and honestly,
it would be interesting to know what words are popular. I
think I can do the above steps for English, but not for
Japanese news so I'll let someone else help you.
I'd probably do the above with a cronjob, wget, perl with
some kind of hash table... How to cut Japanese text into
compounds without breaking up sensible ones is something I
don't know how to do and is perhaps the hardest part...
But some alternatives for you might be:
a) Some of the test collections here
http://research.nii.ac.jp/ntcir/permission/perm-en.html have
Japanese newspapers. I don't know how to get permission to
access the collections, though -- you'll have to go read the
agreement forms.
b) You might want to learn kanji through the Kanji Kentei
tests. The order of kanji is slightly different from JLPT
and with 10 or so levels, they're broken down into more
manageable units. I don't know how old a Japanese child
should be before s/he is expected to be able to read a
newspaper [I can't remember my childhood], but surely you
don't have to do all 10 levels...
Ray
Home |
Main Index |
Thread Index