The Shared State Trap: How a FastAPI 'Optimisation' Leaked User Data
We replaced Flask's request-scoped g with a plain module-level dict during
our FastAPI migration. It worked perfectly in tests and staging. In production, under
concurrent load, it silently served one tenant's data to a completely different user —
for three days before we caught it.
The Rewrite Nobody Questioned
Four years into running a Flask 1.x reporting API, we decided to rewrite it in FastAPI. The pitch was sound: native async support for slow I/O endpoints, automatic request validation via Pydantic, and OpenAPI docs that would actually stay in sync with reality. Management approved. Engineering was excited. Two sprints later, tests were green, our slowest endpoints were 40% faster in load tests, and we deployed on a Wednesday.
By all appearances, the migration was a success. Our dashboards glowed a healthy green. Then, seventy-two hours later, the support tickets started arriving.
Wrong Data, Zero Errors
"I'm seeing reports that don't belong to my company."
The first ticket we dismissed as a frontend cache glitch. The second made us nervous. The third — with a screenshot — confirmed our worst fear: a multi-tenant data leak. Users were receiving valid, well-formed API responses with correct HTTP 200 status codes, but the data inside belonged to a different organisation.
The terrifying part? The logs were completely clean. No exceptions. No 500 errors. No suspicious query patterns. No anomalous latency spikes. Just a steady stream of healthy 200 responses that happened to contain the wrong organisation's data.
I spent the next afternoon adding deep instrumentation — logging the org_id
extracted from the JWT at auth time, the org_id passed to each database
query, and the org_id present on the returned rows. I deployed and waited.
When the next incident hit, the log line read:
auth.org_id=2041 → query.org_id=2041 → result.org_id=1038
The auth was correct. The query filter was correct. The data that came back wasn't. Unless… the data didn't come from the database at all.
The "Optimisation" That Broke Everything
In the old Flask codebase, we used flask.g extensively — Flask's
request-scoped proxy that stores arbitrary per-request data for the duration of a
single request. It was how we passed context (org ID, user ID, request metadata)
down through deep call chains without threading it through every function signature.
It was convenient. It was idiomatic Flask. It worked reliably for four years.
During the FastAPI migration, one of the team replaced flask.g with what
seemed like an equivalent: a module-level dictionary. Cleaner, they thought. No import
from Flask. More "Pythonic."
# Looked harmless. Was catastrophic.
_request_context: dict = {}
def set_context(org_id: int, user_id: int) -> None:
_request_context["org_id"] = org_id
_request_context["user_id"] = user_id
def get_org_id() -> int:
return _request_context.get("org_id")
# Used in the route handler:
@router.get("/reports/{report_id}")
async def get_report(
report_id: int,
token: TokenData = Depends(verify_token),
):
set_context(token.org_id, token.user_id) # Set context for this "request"
await asyncio.sleep(0) # Yield to event loop (batching)
report = await fetch_report(report_id) # Calls get_org_id() internally
return report
In Flask, this pattern is safe. Flask uses Werkzeug's LocalProxy backed by
threading.local() under the hood. With a thread-per-request model, each
thread has its own isolated copy of any thread-local variable. Setting and reading
_request_context from Flask's g is inherently scoped to one
request, one thread.
FastAPI is different. It runs on an async event loop. A single OS thread handles
thousands of concurrent requests. That module-level _request_context dict
is one object in memory, shared by every concurrent coroutine. When two requests
are running simultaneously and both write to the same keys — the last write wins,
and whoever reads next gets the wrong value.
How the Corruption Happens
To understand why this fails, you need to see how Python's async event loop interleaves
coroutines. When a coroutine hits an await, it yields control back to the
event loop, which picks up another coroutine. This cooperative scheduling is why async
code is fast — it's also why shared mutable state is a trap.
BROKEN: Module-level dict, two concurrent requests
Time │ Request A (org=2041) Request B (org=1038)
─────┼──────────────────────────────────────────────────────
t1 │ set_context(org_id=2041)
│ _request_context = {"org_id": 2041}
t2 │ await asyncio.sleep(0) ──────► yields to event loop
t3 │ set_context(org_id=1038)
│ _request_context = {"org_id": 1038}
t4 │ await db.fetch(...) ──► yields
t5 │ ◄────────────────────────────── event loop resumes A
t6 │ get_org_id()
t7 │ returns 1038 ✗ ← B overwrote A's key!
t8 │ query: WHERE org_id = 1038
t9 │ → org 1038's data returned to org 2041's user
_request_context = {"org_id": 1038}
─────────────────
One shared dict. All requests.
Any await is a potential interleave point. Our handler set the context,
then immediately awaited — a cache lookup, a database call, sometimes just
asyncio.sleep(0) for batching. In that window, another request could write
to the same dict. When the first request resumed, it read the wrong org ID, queried
with the wrong filter, and returned the wrong tenant's data.
Under low load, the timing rarely aligned. Under production load with dozens of concurrent requests, it happened constantly. Because the responses were structurally valid — correct JSON shape, HTTP 200, real data — no automated monitor caught it. There was nothing to catch. From the system's perspective, everything was working.
The Fix: contextvars
Python 3.7 introduced contextvars — a module designed exactly for this
problem. A ContextVar is automatically scoped to the current async task
(or OS thread). Each coroutine gets its own isolated binding. It is the async-native
equivalent of thread-local storage, and it works correctly across await
boundaries.
from contextvars import ContextVar
from typing import Optional
# Each async task gets its own isolated copy of these values.
# ContextVar is safe across await boundaries — no shared state.
_org_id_var: ContextVar[Optional[int]] = ContextVar("org_id", default=None)
_user_id_var: ContextVar[Optional[int]] = ContextVar("user_id", default=None)
def set_context(org_id: int, user_id: int) -> None:
_org_id_var.set(org_id)
_user_id_var.set(user_id)
def get_org_id() -> int:
org_id = _org_id_var.get()
if org_id is None:
raise RuntimeError("org_id not set — is set_context() missing from this path?")
return org_id
def get_user_id() -> int:
user_id = _user_id_var.get()
if user_id is None:
raise RuntimeError("user_id not set — is set_context() missing from this path?")
return user_id
When Request A calls _org_id_var.set(2041), Python's async runtime
stores that binding in A's execution context — a lightweight namespace the event loop
maintains per coroutine. When Request B calls _org_id_var.set(1038),
it writes to B's context. The two never touch.
FIXED: ContextVar, two concurrent requests
Time │ Request A (org=2041) Request B (org=1038)
─────┼──────────────────────────────────────────────────────
t1 │ _org_id_var.set(2041)
│ Context A: { _org_id_var → 2041 }
t2 │ await asyncio.sleep(0) ──────► yields to event loop
t3 │ _org_id_var.set(1038)
│ Context B: { _org_id_var → 1038 }
t4 │ await db.fetch(...) ──► yields
t5 │ ◄────────────────────────────── event loop resumes A
t6 │ _org_id_var.get()
t7 │ returns 2041 ✓ ← reads from A's own context
t8 │ query: WHERE org_id = 2041
t9 │ → org 2041's data returned to org 2041's user ✓
Context A: { _org_id_var: 2041 } ← isolated
Context B: { _org_id_var: 1038 } ← isolated
One import swap. One class change. That's all it took to fix the bug. The damage it caused took considerably longer to address.
An Honest Post-Mortem
We ran a full audit of every affected request — three days of logs, cross-referenced against support tickets and org ID mismatches in our access logs. We identified seventeen tenants who had received at least one response containing another tenant's data. We disclosed to every one of them individually, revoked the affected report exports, and filed a GDPR incident report.
It was one of the most uncomfortable conversations I've had with clients. The data involved wasn't especially sensitive — aggregated analytics, not financial records or PII — but that barely softened it. Data isolation is a contract. You can't partially honour a contract and call it a success.
What We Changed After
Beyond the immediate fix, we made three structural changes to prevent a recurrence:
-
Explicit over implicit context: We deprecated the context helpers
entirely on new endpoints.
org_idanduser_idare now injected via FastAPI'sDepends()system as typed parameters. Every function that needs the org ID receives it explicitly — the data flow is visible in every function signature, not hidden in a global. - Cross-tenant isolation tests: We added integration tests that fire two concurrent requests for different orgs and assert each response contains only data belonging to the requesting org. These tests run in CI on every PR and took about three hours to write. They would have caught this bug in staging immediately.
-
Module-level state lint rule: We added a custom Pylint rule that
flags any mutable module-level dict or list inside the
services/directory. Module-level state is fine for config and constants — not for per-request data. The linter makes the distinction enforced, not advisory.
The Broader Lesson
The mistake wasn't carelessness. The developer who introduced it was experienced. The pattern — storing request context in a "global" — is completely normal in Flask, Django, and every other thread-per-request framework. It's how you avoid prop-drilling context through twenty function signatures. For four years it had worked without issue.
The problem was translating a thread-safe pattern to an async context without understanding what made it thread-safe in the first place.
Flask'sgisn't just a dict. It's backed byLocalProxy, which wrapsthreading.local(). The safety is invisible unless you've read the source. When we copied the pattern without copying the mechanism, we got all of the convenience and none of the isolation.
When migrating from a synchronous to an asynchronous framework, every piece of "ambient" state deserves a hard look. Thread-local storage, request-local proxies, singleton caches — they all behave differently when your execution model changes. What was safe in a thread-per-request world can become a data leak in an async one.
If you're running FastAPI and passing context through your call chain via anything
other than explicit parameters or ContextVar, I'd audit it today.
Not tomorrow. Today. Silent data leaks are patient. They wait for the right
concurrency timing, then they show up in a support ticket with a screenshot.