
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Search MySQL for Japanese Names]
Hi Jim. Overseas conference? ii ne~~
>> Jim, curious question: how many names in ENAMDICT resolve to just one
>> reading? Even a I-would-have-thought-surefire candidate for uniqueness
>> such as 田中(tanaka) resolves to ten different readings in ENAMDICT
>> (tanata, tanka, danaka, nunoka, ....). 鈴木(suzuki) has seven.
> So those ~74k merged entries come from ~205 "raw" entries, i.e. approx.
> 2.8 readings per entry for the 74k.
Okay, I got stuck into enamdict with awk and sort. This is the spread
I found out of the 280,677 entries. (Place, organization, full names
of famous individuals, and names that don't start with kanji are
stripped out. I treat given names and surnames as one.)
No. of readings vs count
1 224,453 (80%)
2 36,602 (13%)
3 10,486 (4%)
4 4,261 (2%)
5 1,904 (<1%)
6 1,071 (<1%)
..
(high counts omitted. If anyone's interested the 'wa'-as-in-wafu kanji
stands out as the most ridiculously overloaded kanji name with 54
readings.)
That's a way higher percentage of uniquely-reading names than I
expected ^_^; Less than one percent have five or more, so pulling
Tanaka and Suzuki out of a hat as I did at the start was really
non-typical sampling :(
My next interest would be spread of names in the real population. Who
knows how the results of the above would be weighted then...
>> ..Mecab/Chasen ... By design these parsers don't want multiple
>> readings for names. They just want the most likely one.
>
> Well, even then the coverage is poor.
No controversy there- the Mecab developer, Taku Kudo, developer said
the same to me last week when I happened to meet him.
Thanks Jim!
Akira
Home |
Main Index |
Thread Index