TLUG Mailing List

Steps 3, 4, and 5 are quite possible, accepting that you will only get, say, 97% accuracy with natural japanese language parsing. I do similar things already within the Chrome version of the Furigana Injector browser extension I developed.

Step 3 involves parsing the text into separate morphemes, which I'd recommend Mecab for, then in your case you'd discard non-compound words out of all those tokens. I expect the python wrappers for Mecab will be a good choice. I use the C API myself though so I can't confirm that they really work.

For step 4, yes, WWWJDIC can be accessed via CGI or you can load the dictionary files yourself. There are some hassles here regarding what to do if more than a single match is found, or no exact match is found because some entry keys are concatenations of all the possible writings (e.g. "広い(P); 弘い; 廣い; 宏い". Also, if there are truly no hits in EDICT and then as a second step try the 'ALL' dictionary (a.k.a. "combined jpn-eng") which combines EDICT, ENAMDICT, LifeSciences dict, Finance dict, etc.

As for Step 5 ... occasionally compounds have multiple readings, so there's not a one-to-one relationship. E.g. "何奴【どいつ; どちつ; どやつ】 ". WWWJDIC semi-colon-concatenates them as you see.

I have no experience with web-spidering, so I can't say anything about Steps 1 and 2.

Regards,

Akira

On Fri, Jul 22, 2011 at 11:18 AM, Martin G <ebisumartin@example.com> wrote:

TLUG,

In the course of my studies, I came across mention of this book, which
lists 1000 kanji compounds useful for reading the news:
http://www.amazon.com/dp/0804809194/

I'm thinking about getting it, even though it's out of print and
hasn't been updated in thirty years.

However, I got to thinking about it, and wondered if with all the
modern tools and the fact that almost all news is online, surely
(hopefully) there would be a way to scan news sites for the most
common compounds and make a spreadsheet of them.

I know one can create graphs of trends in search terms on Google, and
make lists(?) of most popular search terms. Which is why I have this
vague notion that something similar could be constructed out of
existing tools, if it doesn't exist already (I searched but came up
with nothing, though I may not be describing it right).

Anyway, I think TLUG are the go-to guys for this, sitting on the nexus
of internet, coding, and Japanese knowledge.

Would there be a way to:
1. Select a site, set of sites, or possibly an aggregate site to use
as source material.
2. Set a start and end time to frame a span of time in which to select
news articles.
3. Create a list of the most used compounds within that search criteria
4. This step might be a doozy - cross reference that list with WWWJDIC
to get readings and definitions.
5. Output a CSV or text file or something with the compounds,
readings, and definitions in three columns.
...?

I had a PHP code thing that would search within one body of text, pull
out words, and create a study list... it's been years since I've
touched it, so I have to look for it, but if it might help I'll see if
I can dig it up.

What do you guys think? Could be a powerful learning aid.

Any advice would be much appreciated.

--
Dave M G

--
To unsubscribe from this mailing list,
please see the instructions at http://lists.tlug.jp/list.html

The TLUG mailing list is hosted by the award-winning Internet provider
ASAHI Net.
Visit ASAHI Net's English-language Web page: http://asahi-net.jp/en/

Re: [tlug] How to make a current running kanji compound list from the news