
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] Open source license (wikipedia)
- Date: Thu, 3 May 2018 22:11:15 +0100
- From: Darren Cook <darren@example.com>
- Subject: [tlug] Open source license (wikipedia)
- User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0
Semi-hypothetical question: If I take a bunch of text from wikipedia,
and make an md5 hash from it, can I release that hash code under a CC0
or CC-BY license? Or am I legally obligated to release it under the
CC-BY-SA license?
My real question, of course, is can I train a machine learning model on
that text data, and release it under a more liberal license? Assuming
the model is effectively a one-way hash, and cannot reproduce the
original data.
My hunch is yes, it is allowed; but I'd love a pointer to an
authoritative source.
As some background, we've been hacking on a Japanese tokenizer,
https://github.com/rakuten-nlp/rakutenma, which is under MIT license.
But they trained on BCCWJ, and Rakuten would have paid the extra 400,000
yen to allow releasing that model (as "research results for commercial
use"): http://pj.ninjal.ac.jp/corpus_center/bccwj/en/fee.html
If/when we do release our version, I think it'd be better to have it
come with a model built from open data, such as Wikipedia. And, if at
all possible, I want to stick with MIT/CC0/CC-BY licenses.
Google are able to sell n-gram data, with their own usage restrictions,
that they have trawled from the Internet,
(https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html),
which implies to me that you can relicense statistical analysis of data
under any license you choose. But maybe it is more complicated than that?
Thanks,
Darren
Home |
Main Index |
Thread Index