Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] oneliners, Was: Moving on from xterm



On 24 August 2016 at 09:14, NOKUBI Takatsugu <knok@example.com> wrote:
> At Tue, 23 Aug 2016 13:24:15 +1000,
> Jim Breen wrote:
>> I see that Toshinori Sato, who has compiled the "neologd"
>> extensions (there's one for unidic too) has added a lot of
>> expanded terms which are not really morphemes. For example
>> if I put ラテン文字で表記される into it, the unidic and ipadic
>> segmentation is ラテン+文字+で+表記+さ+れる, but if I try
>> it with neologd I get ラテン文字+で+表記+さ+れる. In other
>> words he's added ラテン文字 as a unitary noun. If that's what
>> you want, fine, and his work may well help apps which just
>> want to add furigana to text, but it's getting right away from
>> being a morphological analyzer.
>
> Indeed. mecab-ipadic-neologd has many "long phrase" entries, it over
> morpheme.
>
> However, plain ipadic is also too old. For example, it has "通商産業省"
> entry, not have the former name "経済産業省" (it is also same on
> kakasidict).

Apart from its age, IPADIC also had/has problems with release permissions
dating back to its ICOT source. For that reason the people at NAIST built
a replacement "NAIST DIC". (https://en.osdn.jp/projects/naist-jdic/)

I started to look at fiddle with it, but was told by Yuji Matsumoto at
NAIST to forget it and concentrate on Unidic, which I have done. I see
NAIST-JDIC hasn't been updated for years.

> unidic don't have such entries, so it interprets like 経済+産業+省. It
> is a policy problem how to handle "morphome".

And Unidic's policy is that a morphological analysis lexicon should
just contain morphemes. That means you won't find 日本語 in
Unidic as it's two morphemes. If you want longer formations there are
several chunkers to handle that problem.

> On the other hand, Toshinori Sato said that mecab-ipadic-neologd is
> better performance than plain ipadic on text classification task.
> It's really hard problem...

"text classification task"って? For getting the right yomikata (aka furigana)
on a proper name longer sequences can be useful, but there's a lot of
text analysis where the stuff Sato has added would cause quite some grief.
His addition of "中居正広のミになる図書館" as an entry is a hoot.

> By the way, kakasi is a simple Kanji-Kana converter, so it can only
> kanji-sequence words, dictionary entry can't have non-kanji letter
> like hiragana, katakana. For example, "赤ら顔" is in ipadic and
> unidic, but it can't handle on kakasi (splitted to 赤+ら+顔).

One of the several reasons why I have never bothered with it.

Cheers

Jim




-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links