Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Open source license (wikipedia)



Semi-hypothetical question: If I take a bunch of text from wikipedia,
and make an md5 hash from it, can I release that hash code under a CC0
or CC-BY license? Or am I legally obligated to release it under the
CC-BY-SA license?

My real question, of course, is can I train a machine learning model on
that text data, and release it under a more liberal license? Assuming
the model is effectively a one-way hash, and cannot reproduce the
original data.

My hunch is yes, it is allowed; but I'd love a pointer to an
authoritative source.

As some background, we've been hacking on a Japanese tokenizer,
https://github.com/rakuten-nlp/rakutenma, which is under MIT license.
But they trained on BCCWJ, and Rakuten would have paid the extra 400,000
yen to allow releasing that model (as "research results for commercial
use"): http://pj.ninjal.ac.jp/corpus_center/bccwj/en/fee.html

If/when we do release our version, I think it'd be better to have it
come with a model built from open data, such as Wikipedia. And, if at
all possible, I want to stick with MIT/CC0/CC-BY licenses.

Google are able to sell n-gram data, with their own usage restrictions,
that they have trawled from the Internet,
(https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html),
which implies to me that you can relicense statistical analysis of data
under any license you choose. But maybe it is more complicated than that?

Thanks,

Darren




Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links