Hollywood scripts and subtitles confirmed in AI training data

A new investigation by Alex Reisner for The Atlantic reveals that major tech companies have used dialogue from over 53,000 movies and 85,000 TV episodes to train their AI systems. The data, sourced from OpenSubtitles.org, includes content from acclaimed series like “The Simpsons,” “The Wire,” and “Breaking Bad,” as well as Oscar-nominated films from 1950 to 2016. Companies including Apple, Anthropic, Meta, and Nvidia have utilized this subtitle database without obtaining permission from the original writers. The dataset, part of a larger collection called The Pile, has been circulating among AI developers since 2020. While tech companies argue this usage falls under “fair use,” multiple lawsuits from writers and artists challenge this position, and courts have yet to rule on the matter.

Stay up to date

Related posts: