Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Re: Japanese morphological analyzers (Was: Places where to apply to for a technical internship?)



On Jun 20, 2014, at 2:58 PM, Jim Breen <jimbreen@example.com> wrote:

> Looks interesting, but the biggest problem for me is they are
> using the old IPADIC morpheme dictionary. Until they get past
> saying that Unidic support is "experimental", I'll stick to MeCab.
> 
> Looks like they are using CRFs, as does MeCab. I guess Kuromoji
> is the way to go if you want Java. MeCab is C/C++. I'd like to do
> a side-by-side comparison some day, but they need to support
> Unidic first (the people who built IPADIC at NAIST advise you to
> use Unidic…)

Pardon my dredging up this old thread, but I thought this might be worth passing along.  

Apparently the Unidic license prohibits redistribution, so it probably won’t be used with Kuromoji/Lucene/Solr:

https://issues.apache.org/jira/browse/LUCENE-4056

The license also prohibits commercial use without the permission of the copyright holders (営利を目的として,UniDic ver.1.3.12 を利用する場合は,事前に著作権者と協議すること。)

I’m curious about how others use open-source Japanese morphological analyzers with open-source databases.

From what I have read, the possibilities include Kuromoji with Solr/Lucene and mecab with postgresQL (via textsearch_ja — http://textsearch-ja.projects.pgfoundry.org/textsearch_ja.html).

Is there some widely preferred combination that I haven’t found yet?  I know the big boys like Google and Yahoo use Basis Technology’s Rosette, but that’s a bit rich for my blood.

Drew



Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links