Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Search MySQL for Japanese Names]
- Date: Thu, 29 Oct 2009 18:27:11 +0900
- From: 黒鉄章 <akira@example.com>
- Subject: Re: [tlug] Search MySQL for Japanese Names]
- References: <5634e9210910191749m675cdf8cl3ca73efa0fcbeccb@example.com> <36e8d89d0910191858j2ba89691lb10648d0465fc109@example.com> <5634e9210910270038m4bbb9528hbec50722666a2007@example.com>
Hi Jim. Overseas conference? ii ne~~ >> Jim, curious question: how many names in ENAMDICT resolve to just one >> reading? Even a I-would-have-thought-surefire candidate for uniqueness >> such as 田中(tanaka) resolves to ten different readings in ENAMDICT >> (tanata, tanka, danaka, nunoka, ....). 鈴木(suzuki) has seven. > So those ~74k merged entries come from ~205 "raw" entries, i.e. approx. > 2.8 readings per entry for the 74k. Okay, I got stuck into enamdict with awk and sort. This is the spread I found out of the 280,677 entries. (Place, organization, full names of famous individuals, and names that don't start with kanji are stripped out. I treat given names and surnames as one.) No. of readings vs count 1 224,453 (80%) 2 36,602 (13%) 3 10,486 (4%) 4 4,261 (2%) 5 1,904 (<1%) 6 1,071 (<1%) .. (high counts omitted. If anyone's interested the 'wa'-as-in-wafu kanji stands out as the most ridiculously overloaded kanji name with 54 readings.) That's a way higher percentage of uniquely-reading names than I expected ^_^; Less than one percent have five or more, so pulling Tanaka and Suzuki out of a hat as I did at the start was really non-typical sampling :( My next interest would be spread of names in the real population. Who knows how the results of the above would be weighted then... >> ..Mecab/Chasen ... By design these parsers don't want multiple >> readings for names. They just want the most likely one. > > Well, even then the coverage is poor. No controversy there- the Mecab developer, Taku Kudo, developer said the same to me last week when I happened to meet him. Thanks Jim! Akira
- Follow-Ups:
- Re: [tlug] Search MySQL for Japanese Names]
- From: Jim Breen
- References:
- Re: [tlug] Search MySQL for Japanese Names]
- From: Jim Breen
- Re: [tlug] Search MySQL for Japanese Names]
- From: 黒鉄章
- Re: [tlug] Search MySQL for Japanese Names]
- From: Jim Breen
Home | Main Index | Thread Index
- Prev by Date: [tlug] Sharp comments on Zaurus at kernel summit.
- Next by Date: Re: [tlug] linux@example.com How many widely can we do that?
- Previous by thread: Re: [tlug] Search MySQL for Japanese Names]
- Next by thread: Re: [tlug] Search MySQL for Japanese Names]
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links