What AI coding benchmarks still miss about software quality

Most AI coding benchmarks still ask the question: did the agent produce code that passes the current tests?

This is a useful question, but it is too narrow. Software development is iterative. Requirements change and edge cases appear. Old design decisions become constraints on new work. Code that passes today can still make the next change slower and more expensive, while also increasing risk.

The gap matters more as AI raises the volume of code change. When generation gets cheap, the real question shifts from ‘can the agent produce a working patch?’ to ‘what kind of codebase does repeated agent use create over time?’

Latest Videos From

Andrian Budantsov

A recent paper, SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks (Orlanski et al.), gets closer to that question than most benchmark work. Instead of scoring one-shot solutions, it makes agents extend their own prior code across 20 problems and 93 checkpoints.

Each checkpoint changes the specification. The agent does not start fresh and is not given an internal design to follow. It has to live with earlier choices.

This setup is closer to real development than most benchmark suites, because real teams inherit yesterday’s shortcuts.

files need to be touched for every feature. The software still works, but becomes more difficult to change.

The code-search example in the test is a good example of this issue. At first, the system only needs to find Python code using exact text or regular expressions. Later on, it needs to handle more languages, understand the code structure (AST matching), and even automatically fix problems.

If the initial design is too strict and makes early assumptions, it might pass the first tests but won’t be able to handle the complex, later requirements easily.

The results are clear. None of the evaluated agents solved any problem end to end. The best strict solve rate was 17.2 percent, and by the final checkpoint strict solve rates fell to 0.5 percent. Across trajectories, verbosity rose in 89.8 percent of runs and structural erosion in 80 percent.

The comparison with human-maintained code is even more useful. Against 48 maintained Python repositories, agent-generated code was 2.2 times more verbose and more structurally eroded.

When the authors tracked 20 of those repositories over time, the human code was comparatively flat while the agent code kept worsening with each iteration.

A passing suite tells you the latest version satisfied known checks. It does not tell you whether the code is becoming more fragile or more expensive to extend.

AI tools to write and maintain tests, especially functional UI automation in tools like Playwright. That work follows the same pattern as the paper: the product changes, the test has to change, the next feature adds another branch, another selector, another exception, another helper.

The paper is about coding broadly, not automation test suites specifically, but the mechanism carries over. A test suite can also become verbose and structurally weak under repeated AI-assisted edits.

A degraded test suite is harder to notice than degraded product code. The pipeline can still be green and the suite can still look larger on paper. Coverage can appear to improve.

Meanwhile, the core asset might be degrading. This could include bad selectors, weak checks, copied test steps, overly large helper functions, and UI tests that are hard to fix and easy to doubt. While test flakiness is obvious, problems like tests that don’t do much or tests that run very slowly might not be noticed right away.

For QA leaders, that shifts the job. Quality assurance cannot stop at validating the latest output against today’s requirements. It also has to watch whether repeated change is damaging both the product and the test system that is supposed to protect it.

The role of QA leadership is changing; quality assurance must now go beyond simply verifying the latest product output against current requirements. QA leaders must also monitor whether continuous change is negatively impacting both the product’s quality and the integrity of the testing system designed to safeguard it.

ID, access rights, money, or rules.

The same rule applies to tests. Review how AI-generated test code changes after several product iterations. Watch for suites that grow faster than their signal and UI tests that absorb behavior better covered at lower levels.

Also be aware of ‘self-healing’ maintenance that subtly lowers assertion strength. A larger suite doesn’t automatically mean better control.

Quality needs to move upstream. By the time a feature reaches final validation, some of the damage may already be baked into the path the system took to get there.

QA needs a voice earlier in the loop: in design constraints, review standards, regression strategy, and the definition of acceptable change quality for both product code and test code.

Ultimately, passing tests still matters, but as AI increases the volume of code change, the more useful question is whether each successful change leaves the codebase safer to extend or more dangerous to touch.

We’ve featured the best AI website builder.

This article was produced as part of TechRadar Pro Perspectives, our channel to feature the best and brightest minds in the technology industry today.

The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/pro/perspectives-how-to-submit

Source link

What AI coding benchmarks still miss about software quality

Like this:

Related

What AI coding benchmarks still miss about software quality

Share this:

Like this:

Related