Why is AI-built software breaking at scale? | Batista Consulting

A pattern has formed across the data of the past eighteen months that is no longer reasonable to ignore.

Multiple independent studies, conducted across different methodologies, sample sizes, and code generation tools, have arrived at the same conclusion: software produced primarily by AI coding tools is shipping with measurably higher rates of security vulnerabilities, structural defects, and long-term maintenance burden than software produced by human engineers. The gap is not marginal. It is consistent across studies. And the volume of code now being produced this way is large enough that the problem has stopped being an emerging concern and started being a structural one.

This is the quality crisis. It is happening now. It is documented. And the businesses absorbing the cost are mostly unaware of what they carry.

I. What the studies converge on

There is no single study that establishes the pattern. There are at least eight, conducted between 2024 and 2026, that point in the same direction.

CodeRabbit's December 2025 analysis compared 470 GitHub pull requests, 320 AI-co-authored against 150 human-only, using statistical confidence intervals. AI-generated code introduced security vulnerabilities at 1.88 times the rate of human-written code. Production incidents per pull request increased 23.5% between December 2025 and early 2026.

AppSec Santa's 2026 study tested 534 code samples across six major models against the OWASP Top 10. The vulnerability rate ranged from 19.1% on the safest model to 29.2% on the worst. No model produced consistently secure code.

Veracode's October 2025 GenAI Code Security Report tested over 100 LLMs on coding tasks that forced a choice between secure and insecure implementations. The models chose the insecure option 45% of the time across all languages, and over 70% of the time in Java.

A 2026 ArXiv study analyzed 6,275 public GitHub repositories containing 304,362 verified AI-authored commits. Unresolved technical debt climbed from a few hundred issues in early 2025 to over 110,000 surviving issues by February 2026.

Sherlock Forensics, a security firm conducting structured assessments between January and April 2026, reported that the vast majority of AI-generated codebases contained vulnerabilities considered unacceptable in any production environment.

The numbers are not identical across studies because the methodologies differ. The direction is identical across all of them.

II. Why the pattern exists

The cause is not a flaw in any specific tool. The cause is structural.

AI coding assistants are optimized to produce code that runs. They are scored, internally and externally, on whether the generated output compiles, executes, and matches the prompt's stated intent. They are not scored, with the same rigor or weight, on whether the output is secure, maintainable, or sound under real-world conditions. The trade-off is visible in the published research. Apiiro's analysis of AI-assisted development at Fortune 50 enterprises found that simple syntax errors fell 76% with AI assistance and logic bugs fell 60%. At the same time, architectural security flaws rose by 153%.

The tools became better at the things benchmarks measure. They became worse at the things benchmarks do not measure. The result is code that passes review faster and ships sooner, carrying defects that are invisible until the application is in production with real users.

This is not a moral failing of the tools or of the people using them. It is a predictable outcome of optimization pressure. The same pressure shaped early industrial production. A factory floor optimized for output speed produces more output, and the output is more uniform, and the defects that result are systemic rather than random. The defects show up in the field, not on the line.

Software is now in the same phase. The line is fast. The field is accumulating problems at a measurable rate.

III. The specific defect patterns

Three categories of failure show up repeatedly in the audit data.

Authentication and authorization gaps. AI-generated code reproduces the patterns it was trained on, which include large quantities of tutorial code, sample applications, and demo projects. Tutorial code typically simplifies authentication for clarity. Production code cannot afford that simplification. The result is code that looks correct, runs correctly, and does not enforce the access controls a real application requires. The Moltbook incident, in which a vibe-coded social network exposed 1.5 million API tokens because Row Level Security was not enabled on its Supabase database, is the most-cited recent example. It is not the only one.

Hardcoded secrets and unpinned dependencies. Invicti's 2025 analysis of more than 20,000 vibe-coded web applications found that AI models consistently produce hardcoded API keys, often reusing the same placeholder values across different generated projects. These values, sometimes presented as examples in the model's training data, end up in production code paths because the developer prompting the model did not catch the substitution. Dependency versions, similarly, are frequently left unpinned, which means that an updated package with a known vulnerability can enter the application without any code change on the developer's side.

Architectural failures at scale. Code that works at small scale often does not work at large scale. AI tools, in their current form, are not yet competitive at the architectural decisions that determine whether an application survives growth: database design that scales, service boundaries that contain failure, caching strategies that hold up under load. Applications built primarily through prompts tend to develop a recognizable trajectory. They work for the first few hundred users. They begin to show stress around a few thousand. They break, often catastrophically, somewhere between five and twenty thousand. The pattern is consistent enough that some auditors refer to it as the six-month wall.

IV. What the volume looks like

The defect rate matters less than the defect volume.

Sonar's 2025 developer survey reported that 42% of all production code is now AI-generated or AI-assisted, with developers projecting the share to exceed 50% by 2027. GitHub's internal data, made public in early 2026, suggests that 46% of all code committed to public repositories now passes through an AI assistant at some stage of its production.

If the vulnerability rate is 25% and the volume is approaching half of all code, the implication is direct. By June 2025, AI-generated code was adding more than 10,000 new security findings per month across studied repositories. That is a tenfold increase from December 2024. The trajectory is steeper than the rate at which security teams can grow.

The 2026 Trend Micro TrendAI report identified 6,086 AI-related CVEs between 2018 and 2025, with 2,130 disclosed in 2025 alone. Year-over-year growth was 34.6%. Agentic AI CVEs, a category that did not exist before 2024, grew 255.4% in a single year.

These are not warnings. They are observed counts.

V. The economic logic that produced the crisis

The crisis was not unforeseen. It was a predictable consequence of a specific economic pattern.

AI coding tools became commercially viable in early 2023. They became commercially dominant in late 2024. Through 2025, the value proposition was speed: a startup could now produce in a weekend what previously required a small engineering team to build over months. The cost reduction was real, the speed gain was real, and the businesses that adopted the tools earliest captured significant short-term advantage.

The hidden trade-off was not visible until enough of the resulting code reached production at scale. Production stress reveals what testing does not. By the time the first cohort of seriously vibe-coded applications had crossed the threshold of meaningful user load, the trade-off was visible, and at that point the cost of remediation was significantly higher than the cost of writing the code differently from the start.

This is the structural shape of every technical debt crisis. The cost is paid later, by people who often did not make the original decision, working on systems that have already absorbed the constraints of the original design. The cost compounds because the existing code shapes the next layer of code, and AI tools, asked to extend an AI-generated codebase, reproduce the patterns they find there.

A codebase that begins with weak input validation, hardcoded credentials, and unpinned dependencies will, with high probability, accumulate more of the same as it grows, because the model generating new code is reading the existing code as context.

VI. What this means for businesses

Three groups are absorbing this cost in different ways.

Startups that built on AI tools. Most of the documented incidents to date have happened to startups: Moltbook, Tea, the SaaStr database, the Lovable applications referenced in the Wall Street Journal in mid-2025. These are companies whose entire product was generated through prompts and shipped without an engineering review layer between the AI and production. The pattern is consistent. The business runs for a period of months. Something breaks in a way that is visible to users or to a researcher. The repair cost, when measured against the cost of a structured technical audit before launch, is typically an order of magnitude larger.

Enterprises that introduced AI tools into existing teams. Established companies are in a different position. Their engineering teams use AI tools to accelerate work, but the work still passes through code review, security scanning, and architectural oversight. The defect rate is lower than the headline numbers suggest, because the human review layer is catching most of what AI generates incorrectly. The cost here is different: review burden has increased, security teams are absorbing more findings per developer, and the gap between what AI produces and what production accepts is being absorbed by humans whose time was already constrained.

Mid-sized businesses adopting AI tools without engineering capacity. This is the segment carrying the largest unmeasured risk. Mid-sized companies with limited internal engineering, often relying on a single developer or a small contracted team, have begun to use AI tools to extend their internal applications. The pattern matches the startup pattern: speed gains, no review layer, accumulated risk. The difference is that these are not products being sold to investors. They are operational systems that the business depends on day to day. The breach surface is internal, the data is often customer or financial data, and the organizational capacity to detect a problem before it becomes serious is low.

VII. What changes the trajectory

The studies converge on the same conclusion as the audits. The defect rate does not fall through better tools alone. It falls through structured review applied to the output.

CodeRabbit's analysis, the AppSec Santa study, the Sherlock Forensics report, and the Apiiro Fortune 50 data all show that the defect rate of AI-generated code drops sharply when it passes through static analysis, behavioral testing, and human review with a security focus. The gap between an unreviewed AI-generated codebase and a reviewed one is larger than the gap between human-generated and AI-generated code.

The technology is not the problem. The missing review layer is.

For startups, this means an audit before launch and a quarterly cadence of smaller reviews afterward. For enterprises, it means treating AI-generated code with the same skepticism as code from a contractor who has just been onboarded. For mid-sized businesses, it means accepting that a tool which removes the need for an engineer to write the code does not remove the need for an engineer to read it.

The cost of that review is consistently lower than the cost of the incidents it prevents. IBM's 2025 Cost of a Data Breach report puts the average breach cost for companies with fewer than 500 employees at 3.31 million USD. Sixty percent of small businesses that suffer a major cyberattack close within six months. A structured audit of an early-stage application, including remediation of the highest-severity findings, runs between 5,000 and 30,000 EUR. The math is not subtle.

VIII. The question for 2026

The question is not whether AI coding tools will continue to grow. They will. The question is whether the businesses building on them will start treating the output the same way they treat any other code that touches production.

There is no version of this story where AI-generated code becomes self-securing. There is no version where the model generating the next pull request also catches the security flaw it just introduced. The review layer has to come from outside the model. It has to come from a human, or from a tool operated by a human, or from a process that a human is accountable for.

The businesses that figure this out early absorb the speed gains without absorbing the defect rate. The businesses that do not, ship faster for a period and then ship breaches.

The data has been published. The pattern is clear. The decision is whether to act on it.

Batista Consulting runs technical audits for AI-first and vibe-coded products across Europe. If your application was built primarily through AI coding tools, an audit is the first step toward knowing what is actually inside it. Get in touch.

Sources

[1] CodeRabbit, "AI-Generated Code Security Analysis," December 2025.

[2] AppSec Santa, "OWASP Top 10 Across Six Major LLMs," 2026.

[3] Veracode, "GenAI Code Security Report," October 2025.

[4] ArXiv, "Technical Debt Accumulation in AI-Authored Repositories," February 2026.

[5] Sherlock Forensics, "AI Code Security Report 2026," April 2026.

[6] Apiiro, "AI-Assisted Development at Fortune 50 Enterprises," 2026.

[7] Invicti, "Hardcoded Secrets in 20,000 Vibe-Coded Applications," 2025.

[8] Sonar, "Developer Survey 2025."

[9] Trend Micro, "TrendAI Report: AI-Related CVEs 2018-2025," 2026.

[10] IBM, "Cost of a Data Breach Report," 2025.