Ethics of selling a classifier trained on open data

What are your thoughts on the ethics of productizing a classifier that was trained on an open dataset? Should all products built on open datasets be themselves open source?


I have been wondering about the same question.

One key issue to me is, what is a deep learning classifier? Is it data? Is it software? Both/neither?

In the licensing space, software and data are treated separately, but seem to form kind of parallel universes. For example, the copyleft license ODbL is similar to the GPL, while the Creative Commons Attribution license CC BY is closer to permissive software licenses, like MIT or Apache2). Perhaps such licenses will converge more intricately in the future.

In the meantime, until such questions are more formally resolved, I think the ethical way of operating is that a downstream product of data (such as a classifier trained on it) should have a license that follows the same spirit as the license under which the data was released.


I’m not sure ethics needs to come into it. The license accompanying the open dataset should be specific enough to tell you if your intended use and intended license is warranted. And if it isn’t, you either err on the side of caution by not using it in a way that causes you to not be sure of right, or contact the dataset provider/administrator to clarify your usage query. I would treat the license around the data in the same way I would by using the data in a standard relational database.

It probably is much more of an ambiguous case on ‘real world’ data, like training a classifier from google images.

I can see a case in future, if not already happening, where dataset providers might place “trap streets” in their data to find copyright violators in a world of data and AI classifiers.

1 Like

I think the question that should be answered here is: is a machine learning model that is trained on a particular dataset a derivative work of that dataset? If yes, then the rights holders of the dataset get to determine whether you can or cannot sell such a classifier.

And if they’re not considered derivative works, could a license on a dataset legally prevent you from making machine learning models using that dataset?

I recently read an article that people are starting to test the legalities of these kinds of things. It seems like the lawyers have discovered deep learning too. :wink:

(BTW, I’m currently selling a specific implementation of machine learning models (using Metal on iOS), which include pre-trained weights made from ImageNet. But in my case the thing I’m selling is not the trained weights but the code that is needed to run them.)

1 Like

I think this is a very interesting question with interesting implications in the future. Especially if one follows the mantra of building a core and finetuning it to a task. The two main scenarios I see are
a) Open core + proprietary task data (likely for a company building tools for internal use, e.g. sales prediction, NLP customer satisfaction etc.)
b) Open core + open task data (likely for consumer products, e.g. )
And of course closed core but building your own ImageNet or wiki103 from closed data seems a bit silly.

I feel like an MIT-type license would work well for core data (do whatever you want, would be kind of cool to keep it open though). No idea how hard it would be to enforce a GPL-stlye license for the core data.