Testing in the Age of AI: The Contract-Driven Loop

For a decade, QA engineers built their job security on a single fact: developers refused to write UI tests. Playwright was the moat. If you knew the tool, wrote the fixtures, and maintained the browser matrix, you had a job most developers didn’t want to do. It was job security by other people’s inertia.

That moat closed this year. Not because Playwright got easier. Because AI development tools closed the willingness gap. Claude Code, Cursor, and Aider all generate Playwright suites in the same session as the feature they’re testing. Developers who wouldn’t spend an hour writing a test now get a test in ten seconds by asking. The reason UI testing lived outside the dev team is gone.

The industry is having two arguments about what happens next. One says AI is coming for QA jobs. The other says AI is a productivity multiplier that makes quality easy. Both are half right. Neither describes what actually changed, or what to do about it.

Here’s the third view. AI didn’t eliminate testing. It eliminated the reason testing lived outside the dev team. Testing was always the developer’s job. Scale created the specialization; scarcity of tooling kept it there; developer reluctance made it durable. AI took away the last leg of the stool. What comes after is the interesting question.

That’s the argument. The rest of this piece is the evidence, and a description of what the new setup looks like for the four groups reading this: manual testers wondering if they should retrain, SDETs wondering if their role is redundant, developers wondering how to test what an agent just wrote, and engineering leaders wondering how to staff for any of it.

The historical accident

A quick history to ground the argument.

In the early 2000s, most development teams did their own testing. Not always well, and not always consistently, but the developer who wrote the feature was expected to verify it. Then two things happened at once. Software got bigger. And the cost of a defect in production climbed sharply because more software started running in production continuously instead of shipping quarterly.

Manual regression testing couldn’t scale to catch defects across a growing surface. Automated testing existed but was expensive to set up. Somebody had to build the harness, someone had to maintain the fixtures, and someone had to run the suite before every release. That someone became the QA team.

By 2010, the split was industry-standard. Developers wrote features and unit tests. A separate QA team owned integration, regression, exploratory, and release-gating work. SDETs, a hybrid role that appeared around the same time, bridged the two by building the automation frameworks that let the QA team scale.

None of this was because developers couldn’t test. It was because a specialist could test more efficiently at scale than a distributed set of developers each doing their own thing. That was true when integration tests took an hour to write, a build took thirty minutes, and cross-browser meant paying someone to actually click through the release matrix.

It stopped being true somewhere between 2018 and 2024. Playwright and Cypress replaced Selenium’s setup overhead with npm install. Cloud CI collapsed build wait times. Cross-browser platforms turned the release matrix into a config file. The scaling case for a separate QA team weakened every year. Most orgs didn’t restructure because organizational inertia is stronger than architecture. AI is what makes the restructure impossible to avoid.

What AI actually changed

Working with Claude Code, Cursor, or Aider looks different than writing code by hand in one important way: the human’s attention moves upstream. You don’t sit down and think about the next line. You think about the next behavior. What the system should do, what “done” looks like, what could go wrong. Then you delegate the implementation. Then you check the output.

That last step, checking the output, is testing. Not the ceremony of “write a test case, get it approved, execute it, log the result.” The actual act of verifying that what got built matches what was intended. The exact thing a good tester has always done, now happening thirty times a day inside a developer’s own workflow.

Two consequences follow.

One. The developers who thought they hated writing tests were mostly hating a specific implementation of testing: slow, manual, disconnected from the change they just made, gated by another team’s release process. Cheap contract-checkers, running in seconds, tied directly to the code you just delegated to an agent, is a different thing entirely. Most developers using AI tools already write more tests than they did before, not fewer. They don’t call them tests. They call them checks, contracts, guards. Same idea.

Two. The rigor didn’t disappear. It moved. You can see it in what a good AI-driven pull request looks like now: a precise spec at the top of the change, a set of concrete examples showing before/after behavior, a test file that exercises the new contract, a small demo or artifact confirming the change works end-to-end. That’s more rigorous than the median 2020 PR, not less. It’s just structured differently.

If you’re worried that AI-driven development produces sloppy, untested code, that’s not the failure mode. The failure mode is that AI-driven development produces confidently wrong code when the spec was ambiguous. The response to that failure mode is more testing, more contract-checking, and more upfront thinking about what “done” means. Which is where the SDET playbook has been for a decade.

Contracts, actions, artifacts

The operational unit of AI-augmented development, if I had to name it, is a contract-driven loop with three pieces.

A contract is a specification the agent can be graded against. Not a wall of prose. A precise, ideally executable, statement of what the system should do: inputs, outputs, invariants, edge cases. The clearer the contract, the less the agent can guess wrong. Contracts are the new source of engineering leverage. A team that writes them well ships faster and safer than one that doesn’t, regardless of who writes the actual code.

Actions are the scripts and tests that check the contract automatically. Unit tests. Integration tests. Property-based tests. Contract tests where systems meet. Type checks. Linters. End-to-end flows. Anything that fails fast when reality drifts from the spec. Actions are cheap to run and expensive to author well, which is the point. The team that has good actions runs them dozens of times per change. The team that doesn’t ships bugs.

Artifacts are the verifiable outputs a human can review at a glance: screenshots, coverage reports, benchmark numbers, sample outputs, error traces, diff visualizations. Artifacts turn “the tests passed” from a boolean into something a human can actually evaluate. This is where you catch the confidently-wrong outputs that pass the contract but violate the intent.

The loop: author a contract, delegate to the agent, run actions to check, review artifacts to trust, iterate. Not weekly. Not per PR. Per change, every time, in seconds. When this loop runs fast, AI-driven development is the fastest way to ship software that has ever existed. When it doesn’t (when contracts are vague, actions are slow, or artifacts are hard to inspect), AI-driven development is the fastest way to ship bugs that has ever existed.

Making the loop fast is the highest-leverage engineering work in an AI-augmented team. It is, in one clean sentence, the SDET’s job.

AI proficiency is not vibe coding

There’s a genre of internet post that argues AI-augmented development means nobody reads the code, nobody writes tests, nobody thinks about correctness. You just “vibe code” your way through and let the model figure it out. That’s a fantasy. It’s the AI-era version of the guy who tried to convince you that TDD was optional in 2015 because “modern languages are pretty safe.”

Developers who are actually productive with AI tools (the ones shipping real software into real production, not the ones posting demo videos) are more rigorous about specs and validation than they were before. They have to be. When you delegate the implementation to an agent, the specification becomes the entire remaining source of correctness. If the spec is wrong or ambiguous, the code is worse than useless: it looks right, passes the tests you thought to write, and fails in a way nobody catches until a customer does.

The discipline of testing is more important in an AI-augmented workflow, not less. It just runs at a different altitude. You’re not staring at a for-loop wondering if the off-by-one is right. You’re staring at a contract wondering if it captures every case that matters. The judgment call moved. The discipline didn’t.

This is the framing that most of the “will AI replace testers” content misses. AI didn’t remove the judgment. It amplified it. The bottleneck used to be the typing. Now the bottleneck is the specification.

The role shift

Here’s what changes for each of the four groups.

Manual testers. The biggest career opportunity for a career-manual tester in a decade is right now, and it does not require becoming a full-stack developer. It requires becoming the person who authors the contracts and the artifacts. Everything you already know about failure modes, edge cases, user paths that developers don’t think about, and the specific ways a spec goes wrong under real-world use is more valuable in an AI-driven workflow, not less. The agent can write the test file. The agent cannot decide what “done” means for the user, and it definitely can’t tell you which edge case is going to bite you in production because it lived on the wrong shape of a fifteen-year-old data model. That’s you.

The reskilling path is not “learn to code.” It’s “learn to write specifications the way a lawyer writes a contract” and “learn to configure and run the tools that check them.” The first is a mindset shift. The second is a Saturday afternoon with npx playwright codegen, some YAML, and a coffee.

SDETs. Your role gets more important, not less. The tooling that makes contract-driven loops fast is your work. Fast, cheap, isolated, easy-to-run test harnesses. Coverage and mutation testing frameworks that give devs signal without ceremony. Contract-testing infrastructure between services. Fixture generation. Test-data management. CI cycle-time reduction. Debugging visualization. Every one of these is a lever that multiplies developer output when it works and grinds it to a halt when it doesn’t.

The SDET job used to be “build the automation the QA team runs.” It’s now “build the automation the development team runs, dozens of times a day, on every change.” Same job. Bigger audience. Higher leverage.

Also: SDETs who have historically owned the test strategy (what to test, at what altitude, with what fidelity) should expect that responsibility to move firmly into their court. Nobody else has the depth of pattern-recognition for what falls through the cracks. Developers who are shipping fast need someone who has thought carefully about what “shipped correctly” means. That’s the SDET.

Developers. Whether or not your org still has a QA team, you own testing now. Not “you should own testing.” You own it. If your instinct when an agent produces code is “I’ll ask it to add a test,” you’re missing the point. The test file isn’t the test. The contract is. The generated test is one artifact that reflects the contract. If you can’t state the contract in a couple of sentences, no test file the agent generates is going to catch what you didn’t specify. Learn to write the spec first. Then let the agent implement, generate the test, and produce the artifact. Then review the artifact.

The failure mode to watch for: getting comfortable with checking “did the test pass” and not “does the test check the thing that matters.” The second question is testing. The first is theater.

Engineering leaders. Staffing decisions for QA and SDET in an AI-augmented org look different than they did five years ago. You probably don’t need as many manual testers, because most of the manual regression work is being replaced by contract-driven loops. You need more SDET-equivalent engineers, because the leverage of good tooling is now enormous. And the manual testers you keep should be repositioned as the contract-and-artifact specialists: the people who define what “done” means and design the checks that verify it. Don’t fire them. Retrain them. The alternative is losing institutional knowledge about your own product’s failure modes at exactly the moment when that knowledge became more valuable than it has ever been.

What good tooling looks like

The tooling debate keeps coming back to the same four criteria. Any check the team runs frequently should be:

Easy to configure. If setting up the test environment takes longer than writing the test, the test doesn’t get written. Config-as-code, sensible defaults, low ceremony.
Easy to run locally. The developer changing the code should be able to run the check on their machine, in their language, in seconds. Every step of remove between “I changed something” and “I know if it worked” is a step the developer will skip.
Easy to integrate in CI. The same command that runs locally runs in CI. No parallel infrastructure. No CI-only flakiness. No “works on my machine.”
Produces useful results in under 30 seconds. A red X and a stack trace nobody wants to read is not a useful result. A summary a human can act on (what failed, why it failed, how to reproduce it, and where in the change to look) is a useful result.

The tools that have won in the last three years all hit these criteria. Playwright is winning over Selenium partly because it’s faster and partly because its trace viewer is the best artifact producer in the category. Codecov’s per-diff coverage report is a better artifact than a raw coverage percentage. GitHub Actions won over Jenkins because it took less ceremony to set up. The pattern repeats everywhere.

If you’re evaluating a piece of test tooling and it fails any of the four criteria, don’t buy it. Build the wrapper. Or wait for the replacement. Every year the barrier to building good tooling drops. The teams that build it themselves are quietly the ones shipping fastest.

For coverage specifically, the Test Coverage Calculator and the companion Test Coverage Explained piece go deeper on how to use coverage as a diagnostic instead of a target. For defect density, same pattern with the Defect Density Calculator and the Defect Density explainer.

Bottom line

Testing isn’t dying. It’s coming home. The people who write the code, with or without an agent’s help, should be closer to the tests than they’ve ever been. The specialists who scaled testing for the old world should now build the tooling that makes distributed testing sustainable. And the specialists who owned test strategy should be leading the conversation about what “correctness” means in a world where the agent can write both the code and the test.

The next few pieces in this cluster go deeper on specific angles. Will AI Replace QA Testers? takes the tester’s view of the question and answers it honestly. A follow-up piece will dig into what an SDET actually does in 2026 and why the role is more valuable, not less. A third will be a practical starting point for developers who haven’t tested seriously before and need a reasonable place to begin. Each stands alone; together they map out the contract-driven loop from the perspective of everyone doing the work.

The right response to AI in testing isn’t fear, and it isn’t optimism. It’s discipline about the specification, investment in the tooling, and honesty about what you’re actually checking.

That’s it. That’s the whole change.