How Common Crawl provides paywalled news articles for AI training

The nonprofit Common Crawl Foundation is supplying AI companies with copyrighted news articles scraped from behind paywalls, enabling firms like OpenAI and Google to train their large language models on high-quality journalism. The organization publicly states that it only collects freely available content. Alex Reisner reports for The Atlantic that this claim is false.

According to the investigation, Common Crawl’s web scraper is able to capture the full text of articles from major publishers before the websites’ paywall code can hide the content. This massive internet archive, which is crucial for the development of modern generative AI, contains millions of paywalled articles from sources such as The New York Times, The Wall Street Journal, and The Economist.

Furthermore, the foundation appears to deceive publishers who ask for their content to be removed. While Common Crawl assures publishers that it complies with such requests, Reisner’s research shows that the content remains in its archives. Technical analysis of the archive’s file system suggests no data has been deleted in years. A search tool on the nonprofit’s website also seems to intentionally hide the presence of content from publishers who have submitted takedown requests.

Common Crawl’s executive director, Rich Skrenta, told The Atlantic that publishers “shouldn’t have put your content on the internet if you didn’t want it to be on the internet”.

About the author

Related posts:

Stay up-to-date:

Advertisement