TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] oneliners, Was: Moving on from xterm

Date: Wed, 24 Aug 2016 08:14:33 +0900

From: NOKUBI Takatsugu <knok@example.com>

Subject: Re: [tlug] oneliners, Was: Moving on from xterm

References: <20160819111442.GA30780@quadratic.cynic.net> <9f9cc5f579c92c3ddf7f29865d5862c2@jp.sometwo.net> <20160822114101.GA3944@fluxcoil.net> <87h9ace7zm.wl-knok@daionet.gr.jp> <CABHGxq4gBx39m0+TPZe3LLYPFetAvoc1wfZj0_0YGz3+w2A=1w@mail.gmail.com>

User-agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM/1.14.9 (Gojō) APEL/10.8 EasyPG/1.0.0 Emacs/24.4 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
At Tue, 23 Aug 2016 13:24:15 +1000,
Jim Breen wrote:
> I see that Toshinori Sato, who has compiled the "neologd"
> extensions (there's one for unidic too) has added a lot of
> expanded terms which are not really morphemes. For example
> if I put ラテン文字で表記される into it, the unidic and ipadic
> segmentation is ラテン+文字+で+表記+さ+れる, but if I try
> it with neologd I get ラテン文字+で+表記+さ+れる. In other
> words he's added ラテン文字 as a unitary noun. If that's what
> you want, fine, and his work may well help apps which just
> want to add furigana to text, but it's getting right away from
> being a morphological analyzer.

Indeed. mecab-ipadic-neologd has many "long phrase" entries, it over
morpheme.

However, plain ipadic is also too old. For example, it has "通商産業省"
entry, not have the former name "経済産業省" (it is also same on
kakasidict).
unidic don't have such entries, so it interprets like 経済+産業+省. It
is a policy problem how to handle "morphome".

On the other hand, Toshinori Sato said that mecab-ipadic-neologd is
better performance than plain ipadic on text classification task.
It's really hard problem...

By the way, kakasi is a simple Kanji-Kana converter, so it can only
kanji-sequence words, dictionary entry can't have non-kanji letter
like hiragana, katakana. For example, "赤ら顔" is in ipadic and
unidic, but it can't handle on kakasi (splitted to 赤+ら+顔).
Follow-Ups:

Re: [tlug] oneliners, Was: Moving on from xterm
From: Jim Breen

References:

[tlug] Moving on from xterm
From: Curt Sampson

Re: [tlug] Moving on from xterm
From: Furkan Mustafa

[tlug] oneliners, Was: Moving on from xterm
From: Christian Horn

Re: [tlug] oneliners, Was: Moving on from xterm
From: NOKUBI Takatsugu

Re: [tlug] oneliners, Was: Moving on from xterm
From: Jim Breen

Prev by Date: Re: [tlug] oneliners, Was: Moving on from xterm

Next by Date: Re: [tlug] strong correlation between lines of code and defects (was mojibake? emoji? (was: perl?))

Previous by thread: Re: [tlug] oneliners, Was: Moving on from xterm

Next by thread: Re: [tlug] oneliners, Was: Moving on from xterm

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links