Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Search MySQL for Japanese Names]



Sorry to be answering late. I was in Belgium at a conference and not in
a position to poke around in my files.

2009/10/20 黒鉄章 <akira@example.com>:
> Absolutely right. Mecab/Chasen dictionaries (IPADIC, Unidic, whichever
> one you plug into them) don't include anywhere the amount of name
> readings as ENAMDICT. By design these parsers don't want multiple
> readings for names. They just want the most likely one.

Well, even then the coverage is poor.

> Jim, curious question: how many names in ENAMDICT resolve to just one
> reading? Even a I-would-have-thought-surefire candidate for uniqueness
> such as 田中(tanaka) resolves to ten different readings in ENAMDICT
> (tanata, tanka, danaka, nunoka, ....). 鈴木(suzuki) has seven.

Well, turning it round the other way, ~74k kanji-names have 2 or more
readings. I maintain the file in a single-reading format, so the seven
鈴木s are in seven different entries. The version used in WWWJDIC has
them merged together with an attempt to get the more common readings first.

Some stats:

- raw data file: 728k entries
- version with merged entries: 597k entries.

So those ~74k merged entries come from ~205 "raw" entries, i.e. approx.
2.8 readings per entry for the 74k.

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Treasurer: Hawthorn Rowing Club, VCA Secondary School, Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links