Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] oneliners, Was: Moving on from xterm
- Date: Wed, 24 Aug 2016 12:11:58 +1000
- From: Jim Breen <jimbreen@example.com>
- Subject: Re: [tlug] oneliners, Was: Moving on from xterm
- References: <20160819111442.GA30780@quadratic.cynic.net> <9f9cc5f579c92c3ddf7f29865d5862c2@jp.sometwo.net> <20160822114101.GA3944@fluxcoil.net> <87h9ace7zm.wl-knok@daionet.gr.jp> <CABHGxq4gBx39m0+TPZe3LLYPFetAvoc1wfZj0_0YGz3+w2A=1w@mail.gmail.com> <87fupvdqna.wl-knok@daionet.gr.jp>
On 24 August 2016 at 09:14, NOKUBI Takatsugu <knok@example.com> wrote: > At Tue, 23 Aug 2016 13:24:15 +1000, > Jim Breen wrote: >> I see that Toshinori Sato, who has compiled the "neologd" >> extensions (there's one for unidic too) has added a lot of >> expanded terms which are not really morphemes. For example >> if I put ラテン文字で表記される into it, the unidic and ipadic >> segmentation is ラテン+文字+で+表記+さ+れる, but if I try >> it with neologd I get ラテン文字+で+表記+さ+れる. In other >> words he's added ラテン文字 as a unitary noun. If that's what >> you want, fine, and his work may well help apps which just >> want to add furigana to text, but it's getting right away from >> being a morphological analyzer. > > Indeed. mecab-ipadic-neologd has many "long phrase" entries, it over > morpheme. > > However, plain ipadic is also too old. For example, it has "通商産業省" > entry, not have the former name "経済産業省" (it is also same on > kakasidict). Apart from its age, IPADIC also had/has problems with release permissions dating back to its ICOT source. For that reason the people at NAIST built a replacement "NAIST DIC". (https://en.osdn.jp/projects/naist-jdic/) I started to look at fiddle with it, but was told by Yuji Matsumoto at NAIST to forget it and concentrate on Unidic, which I have done. I see NAIST-JDIC hasn't been updated for years. > unidic don't have such entries, so it interprets like 経済+産業+省. It > is a policy problem how to handle "morphome". And Unidic's policy is that a morphological analysis lexicon should just contain morphemes. That means you won't find 日本語 in Unidic as it's two morphemes. If you want longer formations there are several chunkers to handle that problem. > On the other hand, Toshinori Sato said that mecab-ipadic-neologd is > better performance than plain ipadic on text classification task. > It's really hard problem... "text classification task"って? For getting the right yomikata (aka furigana) on a proper name longer sequences can be useful, but there's a lot of text analysis where the stuff Sato has added would cause quite some grief. His addition of "中居正広のミになる図書館" as an entry is a hoot. > By the way, kakasi is a simple Kanji-Kana converter, so it can only > kanji-sequence words, dictionary entry can't have non-kanji letter > like hiragana, katakana. For example, "赤ら顔" is in ipadic and > unidic, but it can't handle on kakasi (splitted to 赤+ら+顔). One of the several reasons why I have never bothered with it. Cheers Jim -- Jim Breen Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
- Follow-Ups:
- Re: [tlug] oneliners, Was: Moving on from xterm
- From: Stephen J. Turnbull
- Re: [tlug] oneliners, Was: Moving on from xterm
- From: NOKUBI Takatsugu
- References:
- [tlug] Moving on from xterm
- From: Curt Sampson
- Re: [tlug] Moving on from xterm
- From: Furkan Mustafa
- [tlug] oneliners, Was: Moving on from xterm
- From: Christian Horn
- Re: [tlug] oneliners, Was: Moving on from xterm
- From: NOKUBI Takatsugu
- Re: [tlug] oneliners, Was: Moving on from xterm
- From: Jim Breen
- Re: [tlug] oneliners, Was: Moving on from xterm
- From: NOKUBI Takatsugu
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] strong correlation between lines of code and defects (was mojibake? emoji? (was: perl?))
- Next by Date: Re: [tlug] strong correlation between lines of code and defects (was mojibake? emoji? (was: perl?))
- Previous by thread: Re: [tlug] oneliners, Was: Moving on from xterm
- Next by thread: Re: [tlug] oneliners, Was: Moving on from xterm
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links