Legal to machine-learn text from web?

Is it legal to machine-learn texts from websites, Google finds, and resources like Common Crawl? My intent is to expose bias/discrimination; no commercial intent. I’m indie; not affiliated to any institution or company. The librarian’s advice is that I need written permission. The exercise is bound to involve too many sources to seek permission from each.

Does it make any difference which country I reside, and which country my study focuses on?

Found lots of arguments and discussions on the web but haven’t managed to find any definite answer. Some suggest that copyright is infringed if the material is used outside the original expression by the author, as the original meaning is transformed once machine-learned.

Any pointers will be greatly appreciated as this is a deciding factor whether I can go ahead or not. Cheers.

1 Like

From an academic perspective, this a fascinating and complex question. It is also largely unresolved. In the U.S., at least, we do not have a sufficiently large body of case law applying copyright law to machine learning uses for anyone to delineate a bright line between what is legal and what is not. (Also, consider that other countries’ legal systems also apply when you are sampling text from across the web.)

If you have an academic interest in this area, you will find insightful legal journal articles examining the various facets of the issue. For example:

Artificial Intelligence’s Fair Use Crisis

(Bear in mind, this is an article written for a legal academic audience, so it may not be an easy read for everyone.)

From a practical perspective, you should ask yourself what will happen if you sample a huge volume of texts from the internet to train a model for your project. If you’re not using this for text generation, you aren’t running any risk of reproducing the original works. Would the copyright owners know if their material has been digested into a mathematical model? And seeing that you have a non-commercial use, would they be likely to pursue legal action even if they knew?


I think it is legal generally in the US, though I’m not sure about other countries. I worked at a major hedge fund in the past that scraped at scale and they were very risk-averse, and wouldn’t have done it if it was illegal. Along with that OpenAI’s GPT2 (as I understand) was trained on a very large web-scraped dataset.

Hhhmm I do find text generation very inviting. In last year’s Part 1 Jeremy showed how machines wrote abstracts just like that. That was pretty cool.

The most instructive on dos and don’ts I found so far is:

but that’s for UK and Europe; I’m no longer there.

In the paper, Semantics derived automatically from language corpora contain human-like biases, published in Science, the authors used Common Crawl.

Common Crawl is such a wild wide sea of data that it is unlikely for anyone to be able to identify the source, unless explicitly named. Depending on the work and the publication, though, authors could be expected to cite the source for credibility/reproducibility of the work.

As far as I understand, Common Crawl and Google search give us similar data (within matching time frames). It could just be an idea that is not justified but Common Crawl feels more like a safer umbrella to hide under when it comes to legaility.

Would copyright owners be interested to pursue? Well, in the extreme case (not that I plan to), say, if I show how one newspaper is more biased/discriminative than the rest, they might.