Frontier’s AI models don’t just delete document content: they rewrite it, making errors nearly impossible to detect.



As large language models become more capable, users are tempted to delegate knowledge tasks where the models process documents on their behalf and provide the final results. But to what extent can you trust the model to stay true to the content of your documents when you have to repeat them over multiple rounds?

TO new study conducted by Microsoft researchers shows that large language models silently corrupt the documents they work on by introducing errors. The researchers developed a benchmark that simulates multi-step autonomous workflows across 52 professional domains, using a method that automatically measures how much content degrades over time.

Their findings show that even state-of-the-art models corrupt an average of 25% of document content at the end of these workflows. And providing models with realistic distractor tools or documents actually worsens their performance.

This serves as a warning that while there is increasing pressure to automate knowledge work, current language models are not fully reliable for these tasks.

The mechanics of delegated work

The Microsoft study focuses on “delegated work,” an emerging paradigm in which users allow LLMs to complete knowledge tasks on their behalf by analyzing and modifying documents.

A prominent example of this paradigm is vibration codingwhere a user delegates software development and code editing to an AI. But delegated workflows go far beyond programming and span other domains. In accounting, for example, a user could provide a dense ledger and tell the model to divide the document into separate files organized by specific expense categories.

Because users may lack the time or specialized expertise to manually review every modification that AI implements, delegation often depends on trust. Users expect the model to faithfully complete tasks without introducing uncontrolled errors, unauthorized deletions, or hallucinations into the documents.

To measure the extent to which AI systems can be trusted in iterative and extended delegated workflows, the researchers developed the DELEGATE-52 Benchmark. The benchmark is comprised of 310 work environments spanning 52 diverse professional domains, including financial accounting, software engineering, crystallography, and music notation.

Each working environment is based on initial real-world text documents ranging from 2000 to 5000 tokens. In addition to the initial document, the environments include five to ten complex, non-trivial editing tasks.

Grading a complex, multi-step editing process often requires expensive human review. DELEGATE-52 avoids this by using a “round-trip relay” simulation method that evaluates responses without requiring human-annotated reference solutions. The approach is inspired by the back-translation technique used in machine translation testing, where an AI model is asked to translate a document from one language to another and come back to see how perfectly it reproduces the original version.

Consequently, each editing task in DELEGATE-52 is designed to be completely reversible, combining a direct statement with its precise inverse. For example, an instruction to split the general ledger into separate files by expense category is combined with an instruction to merge all category files back into a single general ledger.

In comments provided to VentureBeat, Philippe Laban, a principal researcher at Microsoft Research and co-author of the paper, clarified that this is not simply a test of whether an AI can achieve "undo." Because you can’t force human workers to do it instantly. "forget" A task they have just completed, this back-and-forth evaluation is especially suitable for AI. By starting a new conversation session, the researchers force the model to attempt the reverse task completely independently.

The models in their experiments “do not know whether a task is a step forward or backward and are unaware of the overall design of the experiment." Laban explained. "They are simply attempting each task as thoroughly as they can at each step."

These round-trip tasks are chained together in a continuous stream to simulate long-term workflows spanning 20 consecutive interactions. To make the environment more realistic, the benchmark introduces distractor files in the context of each task. These contain between 8,000 and 12,000 files of documents related to a topic but completely irrelevant. Distractors measure whether the AI ​​can maintain focus or whether it gets confused and gets incorrect data.

Testing frontier models in the relay

To understand how different architectures and scales handle delegated work, the researchers tested 19 different language models from OpenAI, Anthropic, Google, Mistral, xAI, and Moonshot. The main experiment subjected these models to a simulation of 20 consecutive editing interactions.

In all models, documents suffered an average degradation of 50% at the end of the simulation. Even the best frontier models in the experiment, specifically Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, corrupted an average of 25% of document content.

Out of 52 professional domains, Python was the only one where the majority of models reached a ready state with a score of 98% or higher. Models excel at programmatic tasks, but struggle greatly in natural language and specialized domains such as fiction, income statements, or recipes. The overall top model, Gemini 3.1 Pro, was considered ready for delegated work in only 11 of the 52 domains.

Interestingly, the corruption was not caused by the death of a thousand cuts where the models slowly accumulate small errors. Instead, about 80% of the total degradation is caused by rare but massive critical failures, which are one-time interactions in which a model suddenly deletes at least 10% of the document content. Frontier models do not necessarily avoid small errors better. They simply delay these catastrophic failures for later rounds.

Another important observation is that when weaker models fail, their degradation is mainly caused by content removal. However, when boundary models fail, they actively corrupt existing content. The text is still there, but it has been subtly distorted or hallucinated, making it much more difficult for a human supervisor to detect the error.

Interestingly, giving the models an agent harness with generic tools for code execution and file read/write access actually worsened their performance, adding an average of 6% more degradation. Laban explained that failure lies in relying on generic tools rather than domain-specific tools.

"The models lack the ability to write effective programs on the fly that can manipulate files in various domains without errors." he pointed out. "When they can’t do something programmatically, they resort to reading and rewriting entire files, which is less efficient and more error-prone." The solution for developers is to create narrow-scope tools (such as specific functions for calculating or moving entries within .ledger files) to keep agents on track.

Degradation also increases as documents increase in size or more distracting files are added to the workspace. For enterprise teams investing heavily in recovery augmented generation (RAG), these distracting documents serve as a direct warning about the compounding cost of a messy context. While a noisy context window can cause a minimal 1% performance drop after just two interactions, that degradation compounds to a massive 2-8% drop over a long simulation.

"For the recovery community: RAG pipelines should be evaluated using multi-step workflows, not just single-turn recovery benchmarks." Laban said. "Single-lap measurements systematically underestimate the damage of inaccurate recovery."

Reality Check for the Self-Employed Business

The findings from the DELEGATE-52 benchmark offer a critical reality check of the current hype surrounding fully autonomous AI agents.

The benchmark design also implies a practical constraint: because models can maintain a clean record for several steps before a sudden catastrophic failure, incremental human review is necessary, not a single final check. Laban recommends building AI applications around short, transparent tasks rather than complex, long-term agents. This maintains the implication of the action without the writer providing the recipe.

For organizations looking to securely deploy autonomous agents today, the DELEGATE-52 methodology provides a practical model for testing internal data pipelines. Laban explained that "…an enterprise team that wants to adopt this framework needs to create three components: (a) a set of reversible editing tasks representative of their workflows, (b) a parser that converts their domain documents into a structured representation, and (c) a similarity function that compares two parsed representations." Teams don’t even need to build analyzers from scratch. The Microsoft research team successfully reused existing analysis libraries for 30 of the 52 domains tested.

Laban is optimistic about the pace of improvement. "The progress is real and rapid. If we look at just the GPT family, the models go from scoring less than 20% to around 70% in 18 months." Laban said. "If that trajectory continues, models will soon be able to reach saturated scores on DELEGATE-52."

However, Laban cautioned that DELEGATE-52 is deliberately small compared to massive enterprise environments. Even when basic models inevitably dominate this benchmark, the endless long tail of unique business data and workflows means that organizations will always need to invest in custom, domain-specific tools to keep their autonomous agents reliable.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *