At Tue, 23 Aug 2016 13:24:15 +1000,
Jim Breen wrote:
> I see that Toshinori Sato, who has compiled the "neologd"
> extensions (there's one for unidic too) has added a lot of
> expanded terms which are not really morphemes. For example
> if I put ラテン文字で表記される into it, the unidic and ipadic
> segmentation is ラテン+文字+で+表記+さ+れる, but if I try
> it with neologd I get ラテン文字+で+表記+さ+れる. In other
> words he's added ラテン文字 as a unitary noun. If that's what
> you want, fine, and his work may well help apps which just
> want to add furigana to text, but it's getting right away from
> being a morphological analyzer.

Indeed. mecab-ipadic-neologd has many "long phrase" entries, it over

However, plain ipadic is also too old. For example, it has "通商産業省"
entry, not have the former name "経済産業省" (it is also same on
unidic don't have such entries, so it interprets like 経済+産業+省. It
is a policy problem how to handle "morphome".

On the other hand, Toshinori Sato said that mecab-ipadic-neologd is
better performance than plain ipadic on text classification task.
It's really hard problem...

By the way, kakasi is a simple Kanji-Kana converter, so it can only
kanji-sequence words, dictionary entry can't have non-kanji letter
like hiragana, katakana. For example, "赤ら顔" is in ipadic and
unidic, but it can't handle on kakasi (splitted to 赤+ら+顔).

