Microsoft DELEGATE-52: a flaw in every AI agent | Batista Consulting

I. What Microsoft Research just published

On April 17, 2026, three researchers at Microsoft Research published a benchmark called DELEGATE-52. They tested 19 large language models on long delegated workflows across 52 professional domains. The finding matters for every founder shipping code with AI agents. Frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt 25% of document content on average over 20 interactions. The average across all 19 tested models is 50%. Catastrophic corruption (retention of 80% or below) occurs in more than 80% of model-domain combinations.

The corruption is silent. It does not produce errors or visible failures. The document still parses. The code still compiles. The pricing page still renders. The corruption shows up as facts that have shifted and numerical values that have drifted, in ways the founder did not ask for.

This piece unpacks the three findings from the paper that matter most for any startup using AI agents to write production code. It names the products affected. It explains why short trials cannot detect the failure. And it explains what a technical audit catches that the founder cannot.

II. How the benchmark actually works

DELEGATE-52 measures corruption using a method called round-trip backtranslation. The model is given a document and an editing instruction. Then the model is given the edited document and the inverse instruction. A perfect model returns the original document, byte for byte. Anything different is corruption introduced by the model.
This method matters because it removes the most common excuse for AI coding errors: the prompt was unclear. The round-trip applies the model's own edit and its own reversal. Any difference between input and output is the model's behavior, not the user's wording.

The 52 domains in the benchmark span coding (Python, SQL, YAML, Dockerfile, regex, shell), structured records (accounting ledgers, calendars, music notation, subtitles, recipes, EDIFACT), and scientific formats (crystallography, genealogy, chemistry). The 310 work environments are real documents from each domain, not synthetic examples. The 19 tested models cover the current top tier. The list includes Claude 4.6 Opus and Sonnet, GPT 5.4 and GPT 5, Gemini 3.1 Pro and Flash, Kimi K2.5, and Llama 4 Maverick.

The headline finding from section I is not the most damaging result in the paper. The most damaging results are three specific findings about how the corruption behaves. Those are what the next three sections cover.

III. The Python exception

The DELEGATE-52 results show one domain where most models pass: Python. 17 of 19 tested models achieve lossless manipulation in the Python domain. Every other domain ranges from poor to catastrophic.

This matters because Python is the domain founders use to judge AI coding tools. A founder tries Cursor on a Python project. The output looks correct. The tests pass. The founder concludes Cursor is reliable.

Then the founder uses the same tool to edit a Dockerfile, a JSON config, a Makefile, a database migration, an env file, a YAML CI pipeline, a SQL schema. Every one of these is a separate domain in the DELEGATE-52 results. Every one of these is where the corruption happens.

A startup codebase is maybe 30% Python (or TypeScript, where the same pattern holds for the JavaScript family). The other 70% is the surface area where the same agent, the same model, the same prompt produces silent corruption. The founder's confidence is calibrated on the 30%. The damage accumulates in the 70%.

The most common failure surface is not code in the strict sense. It is structured records embedded in code: pricing fields, currency conversions, API keys, environment variables, schema definitions, configuration values. These are the values where a one-character drift produces a real-world consequence the founder will feel. The model handles the surrounding code correctly. The corruption is in how the model reads the embedded value's semantic meaning. That is not a coding error. That is a domain error.

IV. The agentic harness paradox

Cursor, Claude Code, Cline, and Devin are agentic harnesses. They wrap a model (Claude, GPT, Gemini) in a tool layer: file read, file write, code execution, search. The marketing position is that giving the model tools reduces errors because the model can verify its own work.

DELEGATE-52 tested this directly. The Microsoft team ran the same models with and without a basic agentic harness. The harness had file reading, file writing, and code execution tools. The result: models performed 6% worse on average when given tools.

The mechanism is documented in the paper. Models with tools invoke 8 to 12 tool calls per task. Each tool call consumes context tokens. The total context consumed is 2 to 5 times higher than the no-tool version. Long context degrades model performance. The agent's own tool use is what corrupts the document.

This is the opposite of what every agentic IDE promises. The promise is precision. The reality is that the agent's working memory fills up with tool call output. The precision from "the agent can read the file" gets eaten by the agent's own context bloat. By the time the agent writes the file back, the original instruction has drifted out of the attention window.

A note on fairness. The trend in the paper does suggest the gap will close. Better models rely more on code execution (10% for GPT 4.1, 45% for GPT 5.4). The Microsoft team explicitly states that the harness they tested is basic, not state-of-the-art. A more sophisticated harness might recover the lost performance. But that is the future. Today, in May 2026, every founder using Cursor or Claude Code in a long session is operating in the regime the paper measured. The 6% degradation is what they get.

V. The two-interaction trap

DELEGATE-52 includes a finding that should change how every founder evaluates AI tooling. After 2 interactions, GPT 5 and Kimi K2.5 are nearly identical: 91.5% retention versus 91.1%. After 20 interactions, they have diverged sharply: 48.3% versus 64.1%. A 0.4 point gap at interaction 2 becomes a 15.8 point gap at interaction 20.
The evaluation horizon of a founder testing an AI coding tool is one afternoon. Maybe two. The corruption window in the paper is 20+ interactions over a working week. The founder's evaluation method is structurally blind to the failure mode they need to detect.

This is the part of the paper that should be the most uncomfortable for the AI coding category. The tools are sold on demos. Demos are short. Short evaluations do not predict long-horizon behavior. The founder who tried Cursor for a day and concluded it was safe has no information about what Cursor does to a codebase over six weeks.

A concrete example of how this plays out. Consider a SaaS that charges €49 a month. The founder asks an agent to migrate the pricing module from JavaScript to TypeScript. The agent normalizes the price field from a string to a number. In doing so, it reads the value as cents instead of euros. The pricing page still displays €49. The payment provider's dashboard still shows €49 plans. The actual amount charged to every new customer for three weeks is €0.49. No error fires. No test fails. The corruption is one file, one field, 97% of new revenue gone. This is the failure mode DELEGATE-52 measures. It is the kind of damage that does not show up until the founder reads their monthly P&L.

VI. What a technical audit catches

The DELEGATE-52 method is itself a usable framework for a codebase review. The round-trip principle (apply an edit, then its inverse, see what changed) maps to a practical audit pattern. Batista Consulting runs technical audits for vibe-coded startups. The audit for a codebase that has been edited heavily by AI agents focuses on four things:

01Config files versus git history. Identify which config values changed without a corresponding human commit. Cross-check against the founder's memory of intentional changes.
02Schema migrations. Diff the current schema against the migrations that produced it. Confirm every column, index, and constraint is intentional.
03 Agent-touched file inventory. List every file modified by an agentic session. Flag the ones never reviewed by a second human.
04Numerical fields in business logic. Trace every price, fee, percentage, currency conversion, and tax calculation. These are the highest-leverage failure surfaces.

The audit is delivered as a written report. The report names every file an agent has touched, every silent change found in those files, every business-logic field at risk, and every config value that no longer matches the founder's intent. It is the artifact a founder can forward to their CTO, their board, or their next engineering hire.

Over 80% of model-domain combinations in the DELEGATE-52 paper showed catastrophic corruption by interaction 20. The question is not whether the silent corruption is happening in a vibe-coded codebase. The question is where.

If you have shipped code with an AI agent in the last six months and have not had a second pair of eyes review the changes, book a technical audit.