How a GitHub Actions Cache Hit Skipped Our Tests and Shipped a Regression to 12,000 Users
The CI pipeline showed green. The deployment completed successfully. The Slack deploy notification fired at 2:47 PM on a Wednesday with the usual thumbs-up emoji. By 8:30 PM, our support queue had 340 tickets reporting a broken checkout flow. The feature worked in local development. It worked in staging. And technically, CI never failed — it just never tested the code that broke production.
Production Failure: Six Hours of Broken Checkout
The incident started quietly. A few support tickets arrived around 3:30 PM: users applying discount codes at checkout were seeing a blank error screen instead of the order confirmation page. The engineering team was already deep into another sprint, and initial triage assumed it was an edge-case input issue.
By 6 PM, the ticket volume had spiked. We pulled the error logs: every checkout attempt with a non-null discount_code field was throwing an uncaught TypeError: Cannot read properties of undefined (reading 'percentOff'). The stack trace pointed directly at new coupon-validation logic shipped in the 2:47 PM deploy.
The rollback was straightforward. The postmortem question was harder: how did six CI runs pass with zero test failures while shipping code with an obvious undefined-access bug that any unit test would have caught?
False Assumptions: We Trusted Green Without Reading It
The first assumption was the most damaging: green CI means tests ran. We had 14 test suites covering the checkout flow. We had written three new test files specifically for the discount-code feature. The pipeline output said "passed." Nobody checked what that meant in numerical terms.
The second assumption was subtler: a cache hit means the same work happened faster. Caching dependencies is unambiguously correct — restoring 800 MB of node_modules instead of re-downloading it saves 90 seconds per run. But we had extended the same cache to cover compiled test artifacts, and that assumption was wrong.
The third assumption: a 4-second test step means fast tests, not absent tests. Our test suite typically ran in 47 seconds. On the day of the incident, the test step completed in 4 seconds across all six affected deploys. Nobody flagged it because fast CI is generally celebrated, not interrogated.
Investigation: A Test Step That Ran in 4 Seconds
The postmortem started by diffing CI run logs. A passing run from two weeks prior vs the incident-day runs:
# TWO WEEKS AGO (correct run)
Run jest --coverage --ci
PASS src/checkout/__tests__/cart.test.ts
PASS src/checkout/__tests__/pricing.test.ts
PASS src/checkout/__tests__/discount.test.ts <-- new file, compiled fresh
PASS src/checkout/__tests__/coupon-validator.test.ts <-- new file
... (14 suites total)
Test Suites: 14 passed, 14 total
Tests: 203 passed, 203 total
Time: 47.3s
# INCIDENT DAY (cache hit run)
Run jest --coverage --ci
PASS src/checkout/__tests__/cart.test.ts
Test Suites: 1 passed, 1 total
Tests: 18 passed, 18 total
Time: 3.9s
Exit code: 0
One test suite. Eighteen tests. Exit code zero. The cache had restored a compiled test bundle from a run 11 days earlier — before the discount-code branch was merged. Jest found no new .test.ts files outside the cached artifact set, ran only the cached bundle, and reported success on 18 tests instead of 203.
Pulling the cache restore log from the Actions run confirmed it:
Run actions/cache@v3
with:
path: .jest-cache
key: node-test-v1-${{ hashFiles('**/package-lock.json') }}
Cache restored from key: node-test-v1-a3f9d2e8c1b7...
Created: 2026-03-04T06:22:11Z (11 days ago)
Size: 142 MB
The cache key was hashFiles('**/package-lock.json'). The discount-code feature added new test files but added no new npm dependencies — so package-lock.json did not change, the hash matched, the 11-day-old cache was restored, and the new test files were invisible to the cached Jest runner.
Root Cause: Cache Key Scope Too Narrow
The broken CI pipeline had a single cache entry covering both node_modules (correct to cache by lockfile) and .jest-cache (incorrect — Jest's compiled transform cache, which must invalidate when source files change):
BROKEN: Single Cache Key for node_modules + Jest Transform Cache
══════════════════════════════════════════════════════════════════════
package-lock.json hash ──────────────────────┐
v
┌─────────────────────┐
│ Cache Key: v1-a3f9 │
└─────────────────────┘
│ │
┌───────────┘ └──────────┐
v v
node_modules/ (800 MB) .jest-cache/ (142 MB)
[correct: lockfile-bound] [WRONG: stale transforms]
New test files added ──> package-lock.json UNCHANGED
──> cache key UNCHANGED
──> .jest-cache restored from 11 days ago
──> new .test.ts files NOT in cache
──> Jest runs only cached suite (18 tests)
──> exits 0 ✓ (lies)
══════════════════════════════════════════════════════════════════════
FIXED: Separate Cache Keys with Correct Scope
══════════════════════════════════════════════════════════════════════
Cache 1: node_modules
Key: node-modules-v1-${{ hashFiles('**/package-lock.json') }}
Invalidates when: dependencies change (correct behavior)
Cache 2: Jest transform cache
Key: jest-cache-v1-${{ hashFiles('**/package-lock.json',
'src/**/*.ts',
'src/**/*.tsx') }}
Invalidates when: deps OR source files change (correct behavior)
Gate: Test count assertion
if [ "$TEST_SUITES" -lt 14 ]; then
echo "ERROR: expected ≥14 test suites, got $TEST_SUITES"
exit 1
fi
══════════════════════════════════════════════════════════════════════
The node_modules cache should be bound to the lockfile — that's standard and correct. But the Jest transform cache stores compiled TypeScript and JSX artifacts for each source file. When new source files are added, the transform cache must be invalidated. Binding it to the lockfile hash meant it only invalidated when dependencies changed, not when application code or test files changed.
Architecture Fix: Separate Caches, Correct Keys, and a Count Gate
The fix addressed three distinct problems: wrong cache scope, missing source-file invalidation, and no floor on the test count. Here is the corrected workflow:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Cache node_modules
uses: actions/cache@v4
with:
path: node_modules
key: node-modules-v2-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
node-modules-v2-
- name: Cache Jest transforms
uses: actions/cache@v4
with:
path: .jest-cache
# Invalidate when deps OR any source/test file changes
key: jest-cache-v2-${{ hashFiles('**/package-lock.json', 'src/**/*.ts', 'src/**/*.tsx') }}
restore-keys: |
jest-cache-v2-${{ hashFiles('**/package-lock.json') }}-
jest-cache-v2-
- name: Install dependencies
run: npm ci --prefer-offline
- name: Run tests
id: test
run: |
npx jest --coverage --ci --cacheDirectory=.jest-cache --json --outputFile=jest-results.json 2>&1 | tee jest-output.txt
echo "exit_code=$?" >> $GITHUB_OUTPUT
- name: Assert minimum test suite count
run: |
SUITES=$(jq '.numPassedTestSuites' jest-results.json)
TESTS=$(jq '.numPassedTests' jest-results.json)
echo "Test suites passed: $SUITES"
echo "Tests passed: $TESTS"
if [ "$SUITES" -lt 14 ]; then
echo "::error::Expected ≥14 test suites, got $SUITES. Cache may be stale or tests deleted."
exit 1
fi
if [ "$TESTS" -lt 180 ]; then
echo "::error::Expected ≥180 tests, got $TESTS. Count regressed."
exit 1
fi
Three changes matter here:
1. Separate cache entries for separate concerns. node_modules is correctly keyed to the lockfile. The Jest transform cache is keyed to both the lockfile and a glob of all TypeScript source and test files. Adding a new .test.ts file now changes the hash, busts the Jest cache, forces fresh compilation, and ensures the new file is found and executed.
2. restore-keys as a fallback. On a complete cache miss (e.g., first run after a major refactor), restore-keys provides a partial match that pre-warms the transform cache with recently compiled artifacts. Jest will recompile only changed files on top of this partial restore — faster than a cold start, but always correct.
3. An explicit test-count floor. The count gate fails the pipeline if fewer than 14 test suites or 180 tests pass. This gate catches both cache-skipping regressions and accidental test deletion. We update the threshold in a single commit whenever we add a new test suite — a lightweight but remarkably effective tripwire.
Why the 11-Day-Old Cache Survived So Long
In the 11 days between the cache creation and the incident, the team shipped seven separate features. None touched npm dependencies directly. Each deploy picked up the 11-day-old .jest-cache, compiled only the changed source files into memory (but not into the cached artifacts), and ran only the tests that were in the restored cache bundle. The regression was effectively undetectable because the tests that would have caught it simply did not exist in the environment where CI ran.
The problem compounded because the Jest transform cache is an optimization artifact, not a persistent test registry. Jest does not error when a file is absent from the cache — it treats absence as "nothing to run here," not as "this file is new and must be compiled." That behavior is correct for performance, but catastrophic when paired with a stale cache covering the wrong scope.
Lessons Learned
- Cache key scope must match the artifact's actual dependencies.
node_modulesdepends on the lockfile. The Jest transform cache depends on the lockfile and source files. Merging them under a single lockfile-only key silently breaks test discovery whenever new files are added. Always audit what each cached artifact actually depends on before writing the key. - Add a test-count floor to every pipeline. Exit code 0 from a test runner means "all tests that ran passed" — not "all tests that should have run did run." A minimum-count assertion turns silent omissions into hard failures. This is the single highest-leverage safeguard we've added, and it has caught two regressions since.
- Fast CI steps deserve scrutiny, not celebration. A test step that finishes in 4 seconds when it normally takes 47 is a signal that work was skipped, not that the suite got faster. Build timing baselines and alert on anomalous drops, not just on failures.
- Separate caches for separate concerns.
node_modules, compiled test artifacts, build output, and coverage reports all have different invalidation requirements. Bundling them under one key is convenient but produces incorrect behavior whenever any one of their dependency sets diverges from the others. - Validate CI output structure, not just exit code. We now parse
jest-results.jsonexplicitly and assert on suite count, test count, and coverage thresholds. An exit code is a binary signal. Structured output is a rich one. Use the richer signal.
The six-hour regression cost us 340 support tickets, a postmortem, and a lost afternoon of engineering time. What it gave us was a fundamentally more honest CI pipeline. Green now means something specific: at least 14 test suites ran, at least 180 tests passed, and the cache that served them was invalidated by any change to the files those tests cover. That's a guarantee worth having.