A comprehensive analysis examining the ethics of using the internet’s collected works to train large language models has been published by Robin Sloan, addressing fundamental questions about the morality and implications of this practice. The essay asks whether it is ethically acceptable to use humanity’s collective written works – what Sloan terms “Everything” – as training data for AI systems.
The analysis focuses on the central role of training data in AI development, arguing that approximately 50-90% of a language model’s capabilities stem directly from this collective knowledge base. Sloan emphasizes that there is currently no viable alternative to training AI on the internet’s vast repository of human-generated content.
The author presents a nuanced perspective, suggesting that the ethical acceptability depends largely on the application. According to Sloan, if AI systems primarily generate content that displaces human creators, the practice is problematic. However, if these systems enable significant scientific advances or provide broader societal benefits, the use of collective human knowledge becomes more justifiable.
The essay makes important distinctions between different AI applications. While Sloan expresses concerns about image generation models that merely produce more images, he views foundation models like Claude more favorably due to their diverse applications beyond content generation. He particularly endorses AI’s role in code generation and translation services.
A key argument presented is that the value of training data lies not just in its volume but in its authenticity – content created for genuine purposes rather than specifically for AI training. Sloan suggests that even well-funded attempts to create alternative training datasets would likely fall short of matching the richness of naturally occurring internet content.
The analysis acknowledges the tension between potential benefits and risks, noting that while AI companies might promise future scientific breakthroughs while simultaneously disrupting creative industries, the possibility of significant technological advances cannot be dismissed.
Sloan concludes by proposing a practical framework: AI applications that provide substantial public benefits can justify their use of collective human knowledge, while those that merely replicate existing content do not. This position acknowledges both the unprecedented nature of using humanity’s collective works and the potential for meaningful technological progress.