Harvard University has launched a comprehensive AI training dataset containing nearly one million public domain books. According to technology journalist Kate Knibbs writing for Wired, the project is funded by Microsoft and OpenAI. The Institutional Data Initiative leads this effort to democratize access to high-quality training data for AI development.
The collection, which is five times larger than the Books3 dataset, includes works from Shakespeare, Dickens, and Dante, alongside specialized texts in various languages. Greg Leppert, the Initiative’s executive director, says the project aims to “level the playing field” for smaller AI developers and researchers. The dataset will be distributed through Google, though specific release details are still being finalized. Additionally, the Initiative is collaborating with the Boston Public Library to scan millions of public domain newspaper articles.