The Claude extended thinking mode that changes how I debug hard problems
I was three hours into debugging an async race condition in our Node.js order processing service. The bug was intermittent — happening maybe once every few hundred requests — and every reproduction attempt felt like fishing in fog. I'd already asked Claude for help twice, got reasonable answers, tried them, and the bug persisted. Then, almost by accident, I enabled a setting I'd been ignoring: extended thinking. What came back wasn't just an answer. It was 2,000 tokens of Claude reasoning through the problem like a senior engineer talking through their mental model — and halfway through reading it, I spotted the bug myself.
What Most Engineers Do With Claude
The typical Claude debugging workflow goes like this: paste the broken code, describe the symptom, hit send, and read the response. For most problems, this works great. Claude is fast, the answer is usually in the first paragraph, and you move on.
But for genuinely hard bugs — race conditions, complex state interactions, subtle async ordering issues, performance problems that only appear under load — this pattern hits a ceiling. Claude gives you an answer, but it's often treating symptoms rather than causes. You try the suggestion, the bug is still there. You try again. Same result.
The problem isn't Claude's knowledge. It's that for complex problems, the reasoning process is the thing. A senior engineer doesn't just output an answer — they think out loud, challenge their own assumptions, consider edge cases, explore multiple hypotheses. Standard Claude responses skip all of that. Extended thinking doesn't.
What I Found: Extended Thinking Mode
Extended thinking is a mode where Claude pauses before answering and works through the problem in an internal reasoning chain — visible to you via the API. It's been in the Anthropic API for a while, but it's easy to miss because it requires an explicit opt-in and it's not available in the standard chat UI.
Here's how you enable it:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 16000,
thinking: {
type: "enabled",
budget_tokens: 10000, // how much Claude can "think" before answering
},
messages: [
{
role: "user",
content: `Here's a Node.js async race condition I've been debugging for 3 hours.
The bug is intermittent (~1 in 300 requests), happens only under concurrent load,
and my existing mutex didn't fix it.
${yourCode}
The symptom: orders occasionally get double-charged. Here's the relevant log sequence...`,
},
],
});
// Response contains two blocks: thinking + text
for (const block of response.content) {
if (block.type === "thinking") {
console.log("Claude's reasoning:\n", block.thinking);
} else if (block.type === "text") {
console.log("Claude's answer:\n", block.text);
}
}
The thinking block is the part most people don't know about. It's Claude's internal monologue — not filtered, not polished, just the actual reasoning chain before it assembled the final answer. This is where the real value lives.
Why It's Different — Really Different
To understand why this matters, here's a condensed version of what the thinking block looked like for my race condition:
"The user describes a double-charge bug happening ~1 in 300 requests under
concurrent load. Their mutex didn't fix it. Let me think about why a mutex
might fail here...
If the mutex is instantiated per-request rather than shared across requests,
it's not providing any mutual exclusion at all — each request has its own
lock and they never contend. This is a common mistake.
But wait, looking at the code, the mutex IS module-level. So why would it fail?
Let me look at the async flow more carefully...
The issue might be in how they're awaiting. The charge function does:
await stripe.charge() then await db.markCharged()
If two requests both pass the 'already charged?' check before either one
completes the stripe.charge(), they'll both proceed. The check isn't inside
the mutex — only the DB write is. That's the gap.
Let me verify: yes, the 'isAlreadyCharged' check on line 23 is OUTSIDE the
mutex acquire. The mutex only covers lines 28-35 (the DB write). So two
concurrent requests can both read isAlreadyCharged=false, both call
stripe.charge(), and only then fight over the mutex for the DB write.
The fix isn't a bigger mutex. They need to move the read-check-write into a
single atomic DB transaction, or move the entire operation inside the lock..."
Reading this, I spotted my mistake on the second paragraph. The reasoning chain had reconstructed my bug from first principles and diagnosed the actual root cause — not the symptom I'd been chasing.
Compare that to the standard Claude response, which had correctly suggested "use a mutex" — technically right, but misdiagnosed because it couldn't see that I'd already placed the mutex in the wrong spot.
STANDARD CLAUDE EXTENDED THINKING
───────────────── ────────────────────────────────
Read prompt Read prompt
→ Generate answer → Think: What do I know?
→ Think: What assumptions am I making?
→ Think: What might the user be missing?
→ Think: What edge cases matter here?
→ Think: Let me trace the async flow...
→ Think: Wait — the check is outside the lock
→ Generate answer (with reasoning as context)
Good for: 80% of questions Good for: the other 20%
Fast, accurate, moves you forward Slower, but catches what standard misses
Real Use Cases Where This Changes the Game
1. Race conditions and async bugs
These are the hardest category of bugs because the failure mode depends on timing, not just logic. Extended thinking lets Claude trace through the execution order, consider interleaving scenarios, and think through what the code does under concurrent load — not just what it does in isolation. The thinking block often surfaces assumptions you didn't know you were making.
2. Architecture decisions with non-obvious tradeoffs
I've started using extended thinking for architecture reviews, not just debugging. When I paste a proposed design and ask "what am I missing?", the thinking block surfaces second and third-order consequences that a standard response glosses over. Things like: "if service A starts calling service B synchronously, and B scales independently, what happens during a B deployment?" It reasons through operational concerns, not just structural ones.
// Architecture review — extended thinking finds what you're not asking about
const response = await client.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 16000,
thinking: {
type: "enabled",
budget_tokens: 8000,
},
messages: [
{
role: "user",
content: `I'm designing a notification system for ~50k daily active users.
Here's my proposed architecture: ${architectureDiagram}
What tradeoffs am I accepting, and what should I validate before building?`,
},
],
});
3. Performance problems that appear only under load
When you show Claude a query plan or a flame graph and ask "why is this slow?", a standard response often gives you the textbook answer. Extended thinking reasons through the data: "at their described load of 500 RPS, and assuming row count of X, this index scan will... wait, they said the query is fast in dev. Dev probably has 1000 rows, production has 50 million. The index isn't being used because the planner is choosing a full scan — that's a statistics staleness problem, not an index design problem." That kind of conditional reasoning is what makes extended thinking genuinely useful for non-obvious performance bugs.
4. Code review for subtle correctness issues
I paste PRs into Claude for a second look before merging. With extended thinking enabled, the review catches things like: "this function handles the error case, but if the caller retries on error, and this function has a side effect before the failure point, the retry will cause a double side-effect." That's the kind of reasoning that takes a careful human reviewer 30 minutes to surface. Extended thinking surfaces it in seconds.
The Tradeoffs to Know
Extended thinking isn't free. It's slower — you're waiting for Claude to complete a reasoning chain before the answer starts streaming. And budget_tokens counts against your token usage, so a 10,000-token thinking budget on every request adds up quickly.
My current practice: I use a simple wrapper that defaults to standard mode, with a --think flag for hard problems:
async function askClaude(
prompt: string,
options: { think?: boolean; budgetTokens?: number } = {}
) {
const { think = false, budgetTokens = 8000 } = options;
const params: Anthropic.MessageCreateParams = {
model: "claude-sonnet-4-5",
max_tokens: think ? 16000 : 4096,
messages: [{ role: "user", content: prompt }],
};
if (think) {
params.thinking = {
type: "enabled",
budget_tokens: budgetTokens,
};
}
const response = await client.messages.create(params);
if (think) {
const thinkingBlock = response.content.find((b) => b.type === "thinking");
const textBlock = response.content.find((b) => b.type === "text");
return {
reasoning: thinkingBlock?.type === "thinking" ? thinkingBlock.thinking : null,
answer: textBlock?.type === "text" ? textBlock.text : "",
};
}
const textBlock = response.content.find((b) => b.type === "text");
return {
reasoning: null,
answer: textBlock?.type === "text" ? textBlock.text : "",
};
}
// Fast path — standard answers for standard questions
const { answer } = await askClaude("Explain LATERAL joins in Postgres");
// Deep path — reasoning chain for hard problems
const { reasoning, answer: fix } = await askClaude(
`Intermittent race condition, 3 hours in, mutex didn't fix it...`,
{ think: true, budgetTokens: 10000 }
);
Try It on Your Next Hard Bug
The pattern I've settled on: whenever I'm about to open a second tab, pull in a colleague for a rubber-duck, or start adding console.logs to narrow down a problem I've already spent 30+ minutes on — that's when I reach for extended thinking instead.
It won't replace the rubber-duck. But it will often get you to the right question faster, and for genuinely subtle bugs, reading Claude's reasoning chain is like having a senior engineer explain their mental model in real time.
The budget_tokens parameter is the knob you'll tune most. I've found 8,000–12,000 tokens covers most debugging sessions without becoming expensive. For architecture decisions where I want Claude to really stress-test a design, I'll go up to 16,000.
Start with the bug you've been stuck on longest. Enable extended thinking. Read the reasoning block before the answer. That's where the insight usually is.
# Quick test — paste your hard bug and see the reasoning
npx ts-node -e "
const Anthropic = require('@anthropic-ai/sdk');
const client = new Anthropic.default();
client.messages.create({
model: 'claude-sonnet-4-5',
max_tokens: 12000,
thinking: { type: 'enabled', budget_tokens: 8000 },
messages: [{ role: 'user', content: process.argv[1] }]
}).then(r => {
r.content.forEach(b => {
if (b.type === 'thinking') console.log('=== REASONING ===
', b.thinking);
if (b.type === 'text') console.log('=== ANSWER ===
', b.text);
});
});
" -- 'Your bug description here'