Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Open source license (wikipedia)



On Fri, May 4, 2018 at 5:11 AM, Darren Cook <darren@example.com> wrote:
> As some background, we've been hacking on a Japanese tokenizer,
> https://github.com/rakuten-nlp/rakutenma, which is under MIT license.
> But they trained on BCCWJ, and Rakuten would have paid the extra 400,000
> yen to allow releasing that model (as "research results for commercial
> use"): http://pj.ninjal.ac.jp/corpus_center/bccwj/en/fee.html


I'm not a very authoritative source and I'm not familiar with that tokenizer...

But I guess the model and the software that uses the model are two
different things.  Even if you built a model with their program using
Wikipedia data, you still need their software to make use of the
model.

Is the file format of the model very general?  Or is it in binary form?

I've always wonder what if a book is copyrighted but it lacks an
index.  Then someone comes along and builds an index for that book,
making it more easily accessible to readers.  Has this person violated
copyright?  Or how about if the index is very poorly made (we've seen
that before...some publishers are awful in deciding what keywords are
important...).  And someone comes along and makes a better one...


> Google are able to sell n-gram data, with their own usage restrictions,
> that they have trawled from the Internet,


In a way, they probably justify this because they are already doing it
with their search engine.  They are constructing an index on every
one's web pages (i.e., an index that didn't exist before) and may not
be selling it to us, but are certainly making money through
advertising with this information.

Anyway, sorry that I didn't help at all!  I do find what you're asking
interesting and would like to hear what others think.

Ray


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links