AI models silently corrupt documents in multi-step workflows, study finds

A new study from Microsoft Research reveals that large language models (LLMs) silently corrupt documents during extended, multi-step workflows, often in ways that are nearly impossible for humans to detect. Ben Dickson reports for VentureBeat that even the best-performing AI models degrade an average of 25% of document content across these workflows.

The research team developed a benchmark called DELEGATE-52 to measure this problem. It simulates automated workflows across 52 professional domains, including financial accounting, software engineering, and music notation. The benchmark tests 19 different AI models from major developers including OpenAI, Anthropic, and Google.

The method works by giving models a document editing task, then asking them to reverse it in a separate session. Because the model has no memory of the first session, any differences between the original and the restored document reveal corruption. Across 20 consecutive editing interactions, documents suffered an average degradation of 50% across all models tested.

Critically, the damage does not accumulate gradually. Around 80% of total degradation comes from sudden, catastrophic failures where a model loses at least 10% of document content in a single interaction. Frontier models do not avoid these failures entirely. They simply delay them to later rounds.

The way these failures look also differs depending on the model. Weaker models tend to delete content outright. More advanced models, by contrast, rewrite content in subtly distorted ways. The text remains present but has been quietly altered, making errors far harder for a human reviewer to catch.

Philippe Laban, Senior Researcher at Microsoft Research and co-author of the study, noted that giving models access to agentic tools, such as code execution and file access, actually worsens performance by an additional 6%. “Models lack the capability to write effective programs on the fly that can manipulate files across diverse domains without mistakes,” he explained. When programmatic approaches fail, models fall back on reading and rewriting entire files, which introduces more errors.

The only domain where most models performed reliably was Python programming. The overall top performer, Gemini 3.1 Pro, was ready for delegated work in just 11 of the 52 domains tested.

For organizations using retrieval-augmented generation (RAG), the findings carry an additional warning. Noisy or irrelevant documents in the context window compound degradation significantly over long workflows. A 1% drop after two interactions can grow to between 2% and 8% over an extended simulation.

Laban recommends that developers build applications around short, transparent tasks rather than long autonomous workflows. He also advises using tightly scoped, domain-specific tools instead of generic ones to keep AI agents on track.

Despite the concerning findings, Laban sees reason for optimism. Within the GPT model family alone, scores on similar tasks rose from below 20% to around 70% in 18 months. Still, he cautions that even as foundation models improve, organizations will always need custom, domain-specific tooling to ensure reliability.

Stay up to date

Related posts: