Test Coverage Explained: What 80% Actually Means (And When It Doesn't)

Every engineering interview eventually gets to it. “What test coverage do you target?” The answer someone wants to hear is 80%. The honest answer is that the number matters less than what the number measures, and the tests behind an 80% figure can be worthless or excellent.

Here’s what coverage actually measures, where 80% came from, and how to use it as a diagnostic rather than a career-defining gate.

What coverage actually measures

“Test coverage” is not one metric. It’s four, listed here from cheapest and weakest to most expensive and strongest.

Type	What it checks	Signal strength
Function coverage	Was each function called at least once?	Weak; a smoke test passes it
Line coverage	Was each line of code executed?	Cheap and easy to game
Branch coverage	Was each `if`, `else`, and `switch` path taken?	Meaningfully stronger than line
MC/DC	Did each condition in a decision independently affect the outcome?	Required in aviation and automotive safety standards

When someone says “we have 80% coverage,” ask which of these they mean. The vast majority mean line coverage, the weakest and easiest to inflate. Try the free Test Coverage Calculator with your own numbers to see how the three easy tiers move independently.

Where 80% came from

Nowhere authoritative. The most-cited public benchmarks come from Google’s engineering practices, which describe three tiers rather than a single number:

Coverage	Google’s label
60%	Acceptable
75%	Commendable
90%	Exemplary

Notice 80% isn’t on the list. It’s a round number between two of Google’s tiers, and it caught on because it looks like a Goldilocks target: high enough to be defensible in a code review, low enough to be achievable in most codebases. The Google engineering blog also notes that per-commit coverage goals of 99% are reasonable, which is a much more useful target than any aggregate number.

Martin Fowler summarizes the real problem in one sentence:

Test coverage is a useful tool for finding untested parts of a codebase. Test coverage is of little use as a numeric statement of how good your tests are.

That distinction between diagnostic and target is the whole game. Coverage tells you where to look. It does not tell you whether what’s there is good.

The two failure modes

Teams fail with coverage in opposite directions.

Chasing the number. A team is told to hit 80%. Engineers add tests that call functions without asserting their behavior. Coverage climbs; escape rate does not. The tests are technically present. They catch nothing. Every code review now includes a “please add a coverage test” comment that produces more of the same.

Ignoring the number. A team disables coverage reporting because “it doesn’t measure quality.” Six months later, an incident hits a rarely-taken branch that no test has ever executed. Turns out coverage would have flagged the gap, even if it couldn’t have vouched for the tests themselves.

The way out of both is to use coverage the way Fowler describes: as a report on where you have no signal, not as a stamp on where you do.

How to use coverage well

Three shifts in how you look at the number make it useful again.

1. Measure changed code, not the whole codebase

Aggregate coverage is a slow-moving lagging indicator. Per-commit or per-PR coverage (does this change include tests for the new lines?) is a leading indicator you can act on. Codecov, Coveralls, and SonarQube all support per-diff gates. Set the threshold high (90% or 95%) on new code, and leave legacy code alone.

This is how you make coverage rise over time without a mandate to backfill everything.

2. Weight by criticality

Not all uncovered code is equal. A one-line helper with no branches is a rounding error. An uncovered if inside a payment handler is a Sev-1 waiting to happen. Configure your coverage tool to flag uncovered branches in critical paths (auth, payments, data mutation) at a higher priority than uncovered lines elsewhere. Most coverage tools support per-directory thresholds. Use them.

3. Pair coverage with mutation testing

Mutation testing is the honest version of coverage. It makes small, syntactically valid changes to your code (a > becomes a >=, an && becomes an ||), reruns your tests, and checks whether any test fails. If nothing fails, the test suite doesn’t actually verify that line. It just executes it.

Real mutation-testing tools by ecosystem:

JavaScript / TypeScript: Stryker Mutator
Java: PIT (PITest)
Python: mutmut, cosmic-ray
Go: go-mutesting

Mutation testing is slow. Running it in CI on every commit is impractical. The right cadence is weekly, on the parts of the codebase whose failure would hurt most.

When 90%+ is a requirement, not a target

Consumer software teams get to argue about the right number. Safety-critical teams don’t. Two standards worth knowing about, even if you never touch that code:

DO-178C (avionics). Level A software, the kind whose failure prevents continued safe flight, requires MC/DC coverage. Not line coverage. Not branch coverage. Each condition in each decision must be demonstrated to independently affect the outcome.
ISO 26262 (automotive). ASIL D, the highest safety integrity level, applied to things like brake-by-wire, recommends MC/DC on unit tests.

Both standards exist because in these domains “our tests passed” is not the same as “the software is safe.” The rigor of MC/DC is the difference. If you are ever tempted to argue that 100% coverage is impossible, the counter-argument is that entire industries do it as table stakes.

For the rest of us, the takeaway is more modest: the right coverage target depends on the cost of a defect that ships. Auth code, payment code, and destructive data operations belong closer to the safety-critical end of the spectrum. Marketing pages do not.

The tools

A minimal coverage stack for a modern web codebase looks like this:

Coverage generator. c8 or istanbul (JavaScript/TypeScript), coverage.py (Python), JaCoCo (Java), Cobertura (older Java/.NET). Runs alongside your test runner.
CI enforcement. Codecov or Coveralls for per-PR reporting and gates. SonarQube if you also want static analysis and want to self-host.
Mutation testing. Stryker or the ecosystem equivalent, run on a schedule rather than per-commit.
Cross-browser and cross-environment execution. Coverage generated on your CI’s Chromium instance doesn’t cover Safari-only regressions. Platforms like LambdaTest and BrowserStack let you rerun the same test suite across the browser and OS matrix that actually matches your users. Coverage becomes trustworthy only when the covered code has run in the environments you ship to.

None of this stack is unique or novel. The point is that coverage is one input among four. A team using only one of these (most commonly, the coverage generator with no per-diff gate, no mutation testing, and no cross-browser reruns) has a coverage number and not much else.

Bottom line

Use coverage as a guardrail on new code, not a target for legacy code. Set a high per-diff gate; leave the aggregate number alone. Weight branch coverage on critical paths above line coverage on the rest of the codebase. Complement with mutation testing on a weekly cadence to check that the tests are actually verifying, not just executing. If the code you’re testing has to run in an environment your CI doesn’t cover, pair it with a cross-browser platform so the coverage number reflects the reality your users face.

The 80% target answers the wrong question. Which lines are untested, and do the tested lines actually verify anything? That’s the question worth asking. Coverage is one of the two tools that answers it. Mutation testing is the other.

Try the free Test Coverage Calculator to see where a given number lands against the Google tiers, and check out the Defect Density Calculator for a companion view of quality that’s grounded in shipped defects rather than executed lines.

References

Google Testing Blog. Code Coverage Best Practices (2020)
Martin Fowler. Test Coverage
Wikipedia. Modified Condition/Decision Coverage (MC/DC)
Stryker Mutator. stryker-mutator.io