Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Search MySQL for Japanese Names]



2009/10/29 黒鉄章 <akira@example.com>:
> Okay, I got stuck into enamdict with awk and sort. This is the spread
> I found out of the 280,677  entries. (Place, organization, full names
> of famous individuals, and names that don't start with kanji are
> stripped out. I treat given names and surnames as one.)
>
> No. of readings vs count
> 1  224,453 (80%)
> 2   36,602 (13%)
> 3   10,486 (4%)
> 4    4,261 (2%)
> 5    1,904 (<1%)
> 6    1,071 (<1%)
> ..
> (high counts omitted. If anyone's interested the 'wa'-as-in-wafu kanji
> stands out as the most ridiculously overloaded kanji name with 54
> readings.)

Bizarre, but that's Japanese for you.

> That's a way higher percentage of uniquely-reading names than I
> expected  ^_^; Less than one percent have five or more, so pulling
> Tanaka and Suzuki out of a hat as I did at the start was really
> non-typical sampling :(

I've noticed that the more common names are the ones which tend to have
the greater number of readings. Figures I suppose.

> My next interest would be spread of names in the real population. Who
> knows how the results of the above would be weighted then...

Hard data to get too. When I was at Tokyo Gaidai they had access to
a full copy of the NTT directory. It would have been nice to do some
frequency measures on names, and geographical dispersions on
family names, but there was an embargo on any publications
drawing on the data. They said it was because of "privacy".

>>> ..Mecab/Chasen ... By design these parsers don't want multiple
>>> readings for names. They just want the most likely one.
>>
>> Well, even then the coverage is poor.
>
> No controversy there- the Mecab developer, Taku Kudo, developer said
> the same to me last week when I happened to meet him.

In a year or so i'll be working on a major expansion of the lexicon(s)
used by MeCab et al.. I'll probably be starting with NAIST-JDIC.  I'm
less interested in correct POS tagging and more in correctly
identifying compounds. I want 米軍 to be recognized; not come up
as 米 + 軍.

Cheers

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Treasurer: Hawthorn Rowing Club, VCA Secondary School, Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links