Skip to the main content.

5 min read

AI Writes the Code. Who Tests It?

AI Writes the Code. Who Tests It?

There's a widespread assumption quietly spreading through software development teams: AI tools are so good now that testing becomes less critical. Copilot writes the code, the AI generates the tests, the pipeline goes green - what's left to worry about?

Quite a lot, as it turns out.

AI-assisted development is a genuine productivity breakthrough. But it doesn't eliminate quality risk. It redistributes it - into new places, new shapes, and often out of sight entirely. Understanding where the risk goes is the starting point for understanding why independent testing has never been more important.

 

The Quality Echo Chamber

When an AI generates both production code and the tests for that code, something structurally troubling happens: the system validates its own logic against itself. Tests pass. But they may not be testing the right things.

In traditional development, even imperfect teams have one natural quality check built in: the person writing tests is usually not the same person who wrote the code. That creates friction - and friction finds bugs. In a fully AI-driven pipeline, that separation collapses entirely.

This isn't a hypothetical concern. Research from Qodo found that missing context is the most frequently cited quality problem in AI-generated code, reported by 65% of developers during test generation and code review.

The issue isn't that AI code fails obvious checks. It's that the gaps between what was intended and what was built become invisible to the system that built it.

An external testing team has no shared history with the codebase. No anchoring to the AI's assumptions. No conflict of interest between shipping speed and quality. That's not a limitation - it's the core value.

 

The Code That Looks Right But Isn't

AI-generated code has a distinctive defect profile that makes traditional review processes less effective. The code is syntactically clean, well-formatted, and often passes basic automated checks. Problems emerge at the semantic level: incorrect business logic, missing security controls, subtle edge cases that the prompt didn't specify and the AI didn't anticipate.

The Cloud Security Alliance describes the problem precisely: the most dangerous AI-generated flaws don't look like flaws at all. They appear in the cracks between logic, business context, and edge cases - and they're harder to spot precisely because the code looks correct.

Veracode's 2025 GenAI Code Security Report provides a concrete example of how this plays out in practice: AI routinely generates API endpoints without input validation - not because the code is "wrong" in any obvious sense, but because the prompt didn't ask for it. The absence of security controls is not a syntax error. It won't trigger a linter. It requires a tester who knows what the system is supposed to do - not just what it currently does. Veracode's analysis of over 100 language models across 80 real-world coding tasks found that 45% of all generated code samples contained OWASP Top 10 vulnerabilities - and that rate has remained completely flat despite improvements in the underlying models.

The data on scale is stark: CodeRabbit's analysis of 470 real-world GitHub pull requests found that AI-generated code contains roughly 1.7 times more issues than human-written code. Not just more issues - more severe ones. Critical defects appear at 1.4 times the rate. Logic and correctness problems are 75% more frequent. Performance defects are nearly 8 times more common.

Organizations deploying AI-generated code at speed, without adapted testing processes, are systematically increasing their production defect rate. They typically discover this when the cost of failure is already high.

 

The Debt You Don't See Accumulating

There's a third risk, less visible than the others and potentially the most consequential over time.

AI produces code faster than teams can build the understanding needed to safely change, test, or debug it. Researchers have named this dynamic: Cognitive Debt is the erosion of shared mental models across a team; Intent Debt is the absence of documented rationale and constraints.

A paper published in March 2026 by Margaret-Anne Storey at the University of Victoria captures the key insight: generative AI does not remove the challenges of software engineering - it redistributes them.

Technical debt - the kind you can see in code quality metrics - may actually decrease with AI assistance. But the debt that accumulates in people's heads, and in the missing documentation of why decisions were made, grows quietly and continuously. Teams often don't realize how much understanding they've lost until something breaks unexpectedly. Related research by Shaw and Nave (2026) describes this as "Cognitive Surrender" - the uncritical adoption of AI outputs that inflates confidence even when the AI is wrong, making errors invisible until they surface in production.

For an external testing partner, this dynamic has an important implication: an external team doesn't accumulate Cognitive Debt about a codebase. Every engagement starts from an independent, unanchored perspective. In a world where internal understanding erodes over time, that becomes a structurally durable advantage - not just a project-level convenience.

 

Regulation Is Catching Up

For organizations in regulated sectors, there is now a fourth and formally binding dimension to consider.

The EU AI Act reaches full application on August 2, 2026. For high-risk systems - spanning critical infrastructure, financial services, healthcare, education, and employment - it mandates verifiable, documented, independent quality assurance. Not just quality. Demonstrable quality, with an audit trail, performed by a party that can credibly claim independence from the development process. Penalties reach up to €35 million or 7% of global annual revenue.

Internal teams are structurally ill-positioned to provide this. The regulation doesn't just create compliance overhead - it creates a formal requirement for exactly the kind of external, neutral testing that has always been the core of independent quality assurance.

For organizations in scope, this is no longer optional. For others, it's an early signal: the trajectory of software regulation is moving clearly toward documented, auditable quality processes.

 

What This Means in Practice

The developers themselves know something is wrong. The Stack Overflow Developer Survey 2025, covering nearly 50,000 developers across 177 countries, found that 46% distrust AI tool accuracy. Only 3% report high trust in AI-generated output. Yet the same developers are shipping AI-generated code at unprecedented volume - because review capacity doesn't scale with generation speed.

The DORA State of AI-Assisted Software Development Report (2025) adds a further dimension: AI adoption now has a measurably negative relationship with software delivery stability, and organizations with fragmented quality processes experience AI accelerating their technical debt, not reducing it.

This is the classic pattern that creates professional services demand: a recognized problem that cannot be solved internally. The bottleneck isn't awareness. It's capacity and independence.

Testing AI-generated software requires more than applying existing test frameworks to a new kind of code. It requires understanding the specific defect profile of AI-generated code - where the risks concentrate, how semantic defects differ from traditional ones, and how to evaluate whether a system does what users and the business actually need, not just what the AI interpreted from the prompt.

 

The real question is not whether AI can write code. It's whether your organization can verify that the software is actually fit for purpose. Independent testing helps make that visible — before defects, compliance gaps, or hidden quality risks reach production.

-- Florian Fieber

 

The Position That Matters

There's a simple way to frame what's happening: as development is increasingly automated, testing becomes the last reliable quality control instance. It's the only systematic check that code actually does what it is supposed to do.

That's not a pessimistic view of AI. It's a realistic one. AI-assisted development is a genuine productivity breakthrough - and like every productivity breakthrough in software history, it raises the importance of the quality processes that sit alongside it.

Independent testing partners - with no conflict of interest, no accumulated blind spots, and credible neutrality for compliance purposes - are structurally positioned to fill the gap that AI development creates.

The question "AI writes the code - who tests it?" has a clear answer. It shouldn't be the same system that wrote it.

 


 

If you want to check whether your QA strategy is already up to the task of addressing the risks of AI-generated software, it's worth getting an independent external perspective now.

 


 

Sources

Qodo, State of AI Code Quality (2025) - qodo.ai/reports/state-of-ai-code-quality

Cloud Security Alliance, Understanding Security Risks in AI-Generated Code (July 2025) - cloudsecurityalliance.org

Veracode, GenAI Code Security Report 2025 - veracode.com

CodeRabbit, State of AI vs Human Code Generation (December 2025) - coderabbit.ai

Storey, M-A., From Technical Debt to Cognitive and Intent Debt (March 2026) - arxiv.org/abs/2603.22106

Shaw & Nave, Thinking Fast, Slow, and Artificial (2026) - ssrn.com/abstract=6097646

EU AI Act, Regulation 2024/1689 - digital-strategy.ec.europa.eu

Stack Overflow, Developer Survey 2025 - survey.stackoverflow.co/2025

DORA, State of AI-Assisted Software Development (September 2025) - cloud.google.com

 

Blog