We Set temperature=0 and GPT-4 Still Gave Different Answers — Our Entire CI Pipeline Broke
The CI run passed at 9:14 AM. The identical commit, re-run 40 minutes later, failed.
No code had changed. No dependencies had updated. The diff was empty. But our
automated code review step — powered by GPT-4 with temperature=0 —
had switched from "APPROVED" to "CHANGES_REQUESTED" between runs. We'd spent three
weeks building a pipeline on the assumption that temperature=0 meant
deterministic output. It doesn't. It never did. We'd just been lucky.
What We Built: An LLM-Powered Code Review Gate
Our code review pipeline used GPT-4 to enforce a specific internal standard — we called it our "API contract checklist." Every PR that touched our public API surface ran a GitHub Actions job that sent each changed route's controller code to GPT-4 and asked it to verify 12 specific requirements: error response shape, pagination format, auth header handling, rate limit headers, and so on.
The output was a structured JSON verdict: each of the 12 checks either PASS or FAIL, with a reason for each failure. A PR couldn't merge if any check failed. It had been running for six weeks and had caught 34 real issues that human reviewers missed. We were proud of it.
THE PIPELINE
─────────────────────────────────────────────────────────────
PR opened → GitHub Actions triggered
│
▼
Changed API controllers extracted (git diff)
│
▼
For each controller:
GPT-4 (temperature=0, gpt-4-turbo-preview)
+ "Here are the 12 API contract requirements"
+ "Here is the controller code"
+ "Return JSON: { check_id, result: PASS|FAIL, reason }"
│
▼
Results aggregated → PR status check set
(All PASS → green, any FAIL → red)
Assumption: temperature=0 = deterministic = testable
The Flipping Starts
The first sign was a Slack message from a developer on a Friday: "My PR passed CI twice and failed once on the exact same commit. Did someone change the review prompt?" Nobody had. We looked at the three run logs side by side. Runs 1 and 3 showed Check #7 (pagination format) as PASS. Run 2 showed it as FAIL, with a reason that was technically correct but contradicted the logic the model had used to pass it in runs 1 and 3.
We assumed it was a transient issue — a model serving blip, a load balancing artifact — and moved on. Over the next two weeks, the flipping became more frequent. By the end of the second week, we had four developers who'd learned to just re-run the CI job if it failed on the API review step, because it would usually pass the second time.
That's when we knew we had a real problem: our engineers had started treating an automated quality gate as a coin flip.
What temperature=0 Actually Guarantees
When you set temperature=0, the model uses greedy decoding — at each
token position, it always selects the highest-probability next token. In theory,
given identical inputs and an identical model, this should produce identical outputs.
The key phrase is identical model.
OpenAI updates the models behind their API endpoints continuously. When you call
gpt-4-turbo-preview, you're not calling a frozen model — you're calling
whatever version OpenAI is currently serving under that alias. Model updates change
the weights, which changes the probability distributions, which changes the greedy
decoding output. The same prompt, the same temperature=0, a different
model snapshot: different answer.
There's a second source of non-determinism that persists even with a pinned model version: floating-point non-determinism from GPU parallel computation. Transformer inference runs on GPUs with parallel matrix operations. The order of floating-point additions in parallel is non-deterministic at the hardware level, and floating-point arithmetic is not associative. Two identical requests processed on different GPU hardware configurations can produce slightly different intermediate values, which can cascade into different token selections at positions where two tokens have nearly equal probability.
WHY temperature=0 DOESN'T MEAN DETERMINISTIC
─────────────────────────────────────────────────────────────
Source 1: Model updates
Week 1: gpt-4-turbo-preview → model checkpoint A
│
│ OpenAI silent update
▼
Week 3: gpt-4-turbo-preview → model checkpoint B
Same prompt → different weight distributions → different output
─────────────────────────────────────────────────────────────
Source 2: GPU floating-point non-determinism
Token N probabilities:
Token "PASS": 0.71823419...
Token "FAIL": 0.71823418... ← difference: 0.000000014
On GPU cluster A: PASS wins (FP addition order favours PASS)
On GPU cluster B: FAIL wins (different FP addition order)
This is not a bug. This is how IEEE 754 floating-point works
in parallel computation.
Finding the Model Version Change
We added model fingerprinting to every API call — logging the system_fingerprint
field that OpenAI returns in completions responses. This field changes when the underlying
model is updated. Reviewing our logs, we found that gpt-4-turbo-preview had
updated its system fingerprint on the exact date our flipping rate had spiked from
~0.3% to ~8% of runs.
// We should have been logging this from day one
const response = await openai.chat.completions.create({ ... });
logger.info('llm_call', {
model: response.model,
system_fingerprint: response.system_fingerprint, // log this always
prompt_tokens: response.usage?.prompt_tokens,
completion_tokens: response.usage?.completion_tokens,
run_id: context.runId,
});
The fingerprint change explained the majority of flipping. But it didn't explain all of it — even after the model stabilised at the new version, we still saw occasional inconsistency on specific inputs. That was the GPU floating-point issue affecting tokens with near-identical probabilities.
The Fix: Stop Treating LLMs as Oracles, Start Treating Them as Voters
The architectural fix was to stop relying on a single LLM call and start using a majority vote across multiple calls — a technique sometimes called self-consistency:
// Before: single call, single verdict
const verdict = await reviewCode(controller, requirements);
if (verdict.failures.length > 0) fail();
// After: 3 independent calls, majority vote
const verdicts = await Promise.all([
reviewCode(controller, requirements),
reviewCode(controller, requirements),
reviewCode(controller, requirements),
]);
// For each check, count votes
const finalVerdict = requirements.map(check => {
const votes = verdicts.map(v => v[check.id]);
const failVotes = votes.filter(v => v === 'FAIL').length;
// Require 2/3 agreement to FAIL a check
// A single dissenting vote is not enough to block a PR
return {
check_id: check.id,
result: failVotes >= 2 ? 'FAIL' : 'PASS',
confidence: failVotes === 3 ? 'high' : failVotes === 2 ? 'medium' : 'low',
};
});
We also pinned to a specific dated model snapshot instead of the rolling alias:
// Before (rolling alias — updates silently)
model: 'gpt-4-turbo-preview'
// After (pinned snapshot — stable until deprecated)
model: 'gpt-4-0125-preview'
// And we created a monthly task to review and update the pin
// after testing the new snapshot against our evaluation set
The 3-call majority vote added ~$0.004 per PR check and increased latency from 4s to 7s (parallel calls). The flipping rate dropped to zero over the following 4 weeks of monitoring.
Building an Evaluation Set
The deeper fix was building a golden evaluation set — 200 controller examples with known correct verdicts — that we run against any new model snapshot before updating the pin. This lets us catch regressions before they affect the live pipeline:
// evaluate.ts — run before updating model pin
const GOLDEN_SET = await loadGoldenSet(); // 200 labelled examples
const results = await Promise.all(
GOLDEN_SET.map(async ({ controller, expectedVerdict }) => {
const verdict = await reviewCode(controller, requirements);
return { expected: expectedVerdict, actual: verdict, match: deepEqual(verdict, expectedVerdict) };
})
);
const accuracy = results.filter(r => r.match).length / results.length;
console.log(`Model accuracy on golden set: ${(accuracy * 100).toFixed(1)}%`);
if (accuracy < 0.95) throw new Error('Model does not meet accuracy threshold. Do not update pin.');
Lessons
temperature=0is not determinism — it's greedy. Greedy decoding on a non-frozen model is not reproducible. Never build a pipeline that requires identical LLM output across runs.- Never use rolling model aliases in production.
gpt-4-turbo-preview,gpt-4o,claude-3-5-sonnet-latest— all of these can change under you without notice. Pin to dated snapshots. Test before updating. - Log
system_fingerprinton every call. It's the only way to know if the model behind your API call changed between two runs. - For high-stakes decisions, use self-consistency (majority vote). Three independent calls with a 2/3 threshold is more reliable than one call with any temperature setting.
- If engineers start re-running CI to get a different answer, your pipeline is broken. Non-determinism that developers learn to work around doesn't show up in your error metrics — it shows up in lost trust.