Hobbyist trains Victorian chatbot from scratch on 28,000 public domain

Trip Venturella, a writer and MFA graduate, has built a small language model called Mr. Chatterbox, trained entirely on Victorian-era literature from the British Library. The model draws on 28,035 books published between 1837 and 1899, totaling roughly 2.93 billion tokens of training data.

Venturella used Andrej Karpathy’s nanochat framework and Claude Code, an AI coding assistant, to complete the project without a formal technical background. He rented GPU computing time on Vast.ai and trained the model from scratch rather than adapting an existing one. The finished model has 340 million parameters and takes up 2.05 gigabytes of storage.

Getting the model to hold a conversation proved far harder than the initial training. Venturella went through eight versions before arriving at a usable result, experimenting with dialogue extracted from novels, Oscar Wilde plays, and synthetic conversations generated by Claude Haiku and GPT-4o-mini. The synthetic data helped the model respond to modern questions, but also introduced some recognizably modern AI speech patterns.

Developer Simon Willison tested Mr. Chatterbox and described its responses as closer to a Markov chain than a modern language model. He noted that, according to established research, a model this size would likely need at least twice the training data to perform well as a conversational partner.

The total project cost Venturella approximately $497. The model is available to try on Hugging Face.

Sources: Trip Venturella on Estragon, Simon Willison’s Weblog

Hobbyist trains Victorian chatbot from scratch on 28,000 public domain books

Related posts:

Stay up to date

Related posts: