TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Search MySQL for Japanese Names]

Date: Thu, 29 Oct 2009 18:27:11 +0900

From: 黒鉄章 <akira@example.com>

Subject: Re: [tlug] Search MySQL for Japanese Names]

References: <5634e9210910191749m675cdf8cl3ca73efa0fcbeccb@example.com> <36e8d89d0910191858j2ba89691lb10648d0465fc109@example.com> <5634e9210910270038m4bbb9528hbec50722666a2007@example.com>
Hi Jim. Overseas conference? ii ne~~

>> Jim, curious question: how many names in ENAMDICT resolve to just one
>> reading? Even a I-would-have-thought-surefire candidate for uniqueness
>> such as 田中(tanaka) resolves to ten different readings in ENAMDICT
>> (tanata, tanka, danaka, nunoka, ....). 鈴木(suzuki) has seven.

> So those ~74k merged entries come from ~205 "raw" entries, i.e. approx.
> 2.8 readings per entry for the 74k.

Okay, I got stuck into enamdict with awk and sort. This is the spread
I found out of the 280,677  entries. (Place, organization, full names
of famous individuals, and names that don't start with kanji are
stripped out. I treat given names and surnames as one.)

No. of readings vs count
1  224,453 (80%)
2   36,602 (13%)
3   10,486 (4%)
4    4,261 (2%)
5    1,904 (<1%)
6    1,071 (<1%)
..
(high counts omitted. If anyone's interested the 'wa'-as-in-wafu kanji
stands out as the most ridiculously overloaded kanji name with 54
readings.)

That's a way higher percentage of uniquely-reading names than I
expected  ^_^; Less than one percent have five or more, so pulling
Tanaka and Suzuki out of a hat as I did at the start was really
non-typical sampling :(

My next interest would be spread of names in the real population. Who
knows how the results of the above would be weighted then...

>> ..Mecab/Chasen ... By design these parsers don't want multiple
>> readings for names. They just want the most likely one.
>
> Well, even then the coverage is poor.

No controversy there- the Mecab developer, Taku Kudo, developer said
the same to me last week when I happened to meet him.

Thanks Jim!

Akira
Follow-Ups:

Re: [tlug] Search MySQL for Japanese Names]
From: Jim Breen

References:

Re: [tlug] Search MySQL for Japanese Names]
From: Jim Breen

Re: [tlug] Search MySQL for Japanese Names]
From: 黒鉄章

Re: [tlug] Search MySQL for Japanese Names]
From: Jim Breen

Prev by Date: [tlug] Sharp comments on Zaurus at kernel summit.

Next by Date: Re: [tlug] linux@example.com How many widely can we do that?

Previous by thread: Re: [tlug] Search MySQL for Japanese Names]

Next by thread: Re: [tlug] Search MySQL for Japanese Names]

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links