Major news publishers block Internet Archive over AI scraping fears

Major news publishers are restricting the Internet Archive’s access to their content, worried that AI companies might use the nonprofit’s digital library as a backdoor to scrape their articles for training data.

Andrew Deck reports for Nieman Lab that outlets including The Guardian and The New York Times have taken steps to limit how the Internet Archive can crawl and store their content.

The Guardian has excluded itself from the Internet Archive’s APIs and filtered its article pages from the Wayback Machine’s URL interface. Robert Hahn, the publisher’s head of business affairs and licensing, says access logs revealed frequent crawling by the Internet Archive. He expressed particular concern about the organization’s APIs, which he describes as “an obvious place” for AI businesses “to plug their own machines into and suck out the IP.”

The New York Times has gone further, actively blocking the Internet Archive’s crawlers and adding one of them to its robots.txt file. “The Wayback Machine provides unfettered access to Times content, including by AI companies, without authorization,” a Times spokesperson says.

The Internet Archive operates crawlers to preserve the web, making snapshots accessible through the Wayback Machine. The nonprofit maintains a repository of over one trillion webpage snapshots. However, evidence shows these archives have been used for AI training. A 2023 Washington Post analysis of Google’s C4 dataset found the Wayback Machine ranked as the 187th most present domain out of 15 million, with its data used to build Google’s T5 model and Meta’s Llama models.

Reddit blocked the Internet Archive last August, citing instances where AI companies violated platform policies and scraped data from the Wayback Machine. The Financial Times blocks bots from OpenAI, Anthropic, Perplexity, and the Internet Archive from accessing its paywalled content.

An analysis of 1,167 news websites shows 241 sites from nine countries explicitly disallow at least one Internet Archive crawling bot. Most of these sites (87%) are owned by USA Today Co., which added Internet Archive bots to its robots.txt files in 2025. In an October earnings call, CEO Mike Reed said the company blocked 75 million AI bots in September alone, with 70 million from OpenAI.

“If publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record,” says Internet Archive founder Brewster Kahle.

Michael Nelson, a computer scientist at Old Dominion University, describes the situation as collateral damage. “Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI,” he says. “In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.”

The Internet Archive does not currently disallow any specific crawlers through its own robots.txt file, including those of major AI companies. Kahle has mentioned steps to restrict bulk access to libraries, including rate limiting systems and filtering mechanisms.

About the author

Related posts:

Stay up-to-date:

Advertisement