How SQLAlchemy's Identity Map Served Stale Data to 23,000 API Requests
We wanted explicit control of our SQLAlchemy sessions — no magic, no hidden Flask-SQLAlchemy
globals, just clean dependency injection. We set up scoped_session correctly,
wrote our teardown handler, deployed to production, and went home. Six hours later, our HR
platform was silently returning outdated employee records after every profile update. No exceptions.
No slow queries. No dirty logs. Just 23,000 confident, wrong API responses.
Production Failure: Updates That Never Stuck
The tickets arrived in clusters. An employee would update their email address, receive an HTTP 200 with the new value in the response body, then open their profile thirty seconds later to find the old address still showing. Support initially blamed the frontend. A cache issue, probably. Clear your browser, try again. But the reports kept coming — from different browsers, different users, different offices.
We pulled the database records manually. The updates were there. The PostgreSQL rows contained the correct, updated values. Whatever was returning stale data wasn't the database — it was something in our API layer that had decided the database was optional.
False Assumptions: "Sessions Are Stateless Per-Request"
Our mental model was simple and, it turned out, wrong. We had heard that
scoped_session isolates sessions by thread identity. We were running
Gunicorn with four threaded workers (--threads 4). We assumed that meant
four isolated sessions, one per active thread, each living exactly as long as its
request. Clean. Correct. Thread-safe.
What we missed is that scoped_session creates a registry keyed
by thread identity — but it does not create a new session for each request on that thread.
It creates one session per thread and hands back the same session every time you call
Session() from that thread. The session persists until you explicitly call
Session.remove(). We never did.
SQLAlchemy's scoped_session guarantees one session per thread.
It says nothing about one session per request. That contract is yours to enforce.
Compounding this: SQLAlchemy sessions maintain an identity map — a first-level cache that maps each loaded object's primary key to the in-memory Python object. When you query for an entity that's already in the identity map, SQLAlchemy skips the database entirely and returns the cached object. This is a performance feature designed for within-request deduplication. We were accidentally using it as a cross-request stale-data store.
Reproducing the Bottleneck
The bug only appeared under load. On the local dev server with Flask's single-threaded
built-in server, each request got a fresh session because the process restarted on each
call in debug mode. In our staging environment — a single Gunicorn worker with no
threads — the single worker was reused, but the timing rarely caused symptoms we
noticed. In production with four concurrent threads per worker, one thread would handle
the PUT /employees/42 update, then immediately handle the
GET /employees/42 read from a different client — and serve that client
the version of Employee #42 that was loaded twenty requests ago.
To reproduce locally, we wrote a script that sent two requests on the same thread in
sequence — first a write changing the email, then a read expecting the new email.
Without Session.remove(), the read returned the old email 100% of the
time. With it, the read always returned the updated value. Deterministic reproduction
in forty lines of test code.
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker, scoped_session
engine = create_engine(
"postgresql+psycopg2://user:pass@localhost/hrdb",
pool_size=10,
max_overflow=5,
)
# Creates a thread-local session registry. Correct so far.
Session = scoped_session(sessionmaker(bind=engine))
def get_session():
return Session() # Returns the same session for this thread every time
# app.py — the missing teardown
from flask import Flask
from database.session import Session
app = Flask(__name__)
# ✗ Missing: @app.teardown_appcontext
# ✗ Missing: def shutdown_session(exception): Session.remove()
#
# Without this, the scoped_session's registry holds the session open.
# Every request on the same thread reuses the same session — and its
# identity map — indefinitely. The identity map grows. Old objects
# stay cached. Reads never hit the database for entities already loaded.
Root Cause: The Identity Map Doesn't Know About Requests
SQLAlchemy's identity map is elegant inside a single unit of work. Within one request,
if you load Employee(id=42) twice, SQLAlchemy correctly returns the same
Python object both times — no redundant database round-trip, and mutations are
automatically tracked on the canonical object. This is exactly how an ORM should work.
The problem is that the identity map has no concept of an HTTP request boundary.
It lives on the session. When the session outlives the request — which ours did, for
every request after the first one on each thread — the identity map carries stale state
forward. An Employee loaded at 14:00:01 stays in the map. When the same
thread handles a GET /employees/42 at 14:00:34, SQLAlchemy checks the map,
finds the object, and returns it without emitting a SELECT.
BROKEN: scoped_session without Session.remove()
Thread 1, Gunicorn Worker A
─────────────────────────────────────────────────────────────────
Request 1 GET /employees/42
│ Session.get(Employee, 42)
│ → No identity map entry → emits SELECT → loads e42
│ Identity map: { Employee#42: {email: "old@corp.com"} }
│ Returns: old@corp.com ✓
│
─ request ends — NO Session.remove() — session survives ─────────
│
Request 2 PUT /employees/42 body: {email: "new@corp.com"}
│ Direct UPDATE via different DB connection (fine)
│ Commit succeeds. DB row now has new@corp.com.
│
─ request ends — NO Session.remove() — session survives ─────────
│
Request 3 GET /employees/42 (different user's client)
│ Session.get(Employee, 42)
│ → Identity map HIT → no SELECT emitted
│ → Returns cached: {email: "old@corp.com"} ✗
│
▼ 23,000 requests served stale data this way
Identity map grows unbounded. Database is bypassed.
No exception. No log line. No alert.
The PUT update was issued over a separate database connection (our write
path opened its own connection for explicit transaction control). That connection
committed successfully — the database had the correct data. But the read session's
identity map was never invalidated. It had no way to know that another connection had
changed the row. It simply returned what it remembered.
Architecture Fix: Enforce the Request Boundary
The fix is one function and one decorator — the teardown hook that Flask-SQLAlchemy registers automatically, which we had failed to add in our "explicit" setup:
from flask import Flask
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker, scoped_session
engine = create_engine(
"postgresql+psycopg2://user:pass@localhost/hrdb",
pool_size=10,
max_overflow=5,
# Expire all attributes on commit — fetch fresh values on next access.
# This is the default, but making it explicit prevents future regressions.
)
Session = scoped_session(
sessionmaker(bind=engine, expire_on_commit=True)
)
app = Flask(__name__)
@app.teardown_appcontext
def shutdown_session(exception=None):
"""
Called automatically at the end of every request context.
Session.remove() does two things:
1. Calls session.close() — returns the connection to the pool.
2. Removes the session from the scoped registry — next request
on this thread gets a brand-new session with an empty identity map.
"""
Session.remove()
def get_session():
"""
Within a request, always returns the same session (correct).
After teardown, the next call returns a new session (correct).
"""
return Session()
With this in place, the session lifecycle aligns exactly with the request lifecycle. Every request starts with a clean identity map. Reads always hit the database. The performance cost is negligible — PostgreSQL query latency for a primary-key lookup averages 1.2ms on our cluster, well within our SLA budget.
FIXED: scoped_session + Session.remove() in teardown
Thread 1, Gunicorn Worker A
─────────────────────────────────────────────────────────────────
Request 1 GET /employees/42
│ Session.get(Employee, 42)
│ → Empty identity map → emits SELECT → loads e42
│ Identity map: { Employee#42: {email: "old@corp.com"} }
│ Returns: old@corp.com ✓
│
─ request ends → @teardown_appcontext fires → Session.remove() ──
│ Identity map cleared. Session returned to pool.
│
Request 2 PUT /employees/42 body: {email: "new@corp.com"}
│ Fresh session. UPDATE + COMMIT. DB: new@corp.com ✓
│
─ request ends → @teardown_appcontext fires → Session.remove() ──
│ Identity map cleared. Session returned to pool.
│
Request 3 GET /employees/42 (different user's client)
│ Session.get(Employee, 42)
│ → Empty identity map → emits SELECT
│ → DB returns: {email: "new@corp.com"} ✓
│
▼ Every request: clean session, fresh data, correct result
Avg latency: 1.2ms per primary-key SELECT. Acceptable cost.
We also added two additional layers of defence. First, we set expire_on_commit=True
explicitly in the session factory — this is the default, but naming it makes the intent
visible to future reviewers. When an object is committed, all its attributes are expired,
meaning the next access will reload from the database even if the session persists.
Second, we added a Prometheus gauge tracking active sessions per worker, which now alerts
if a session outlives a single request cycle.
What Changed After the Incident
- Session lifecycle test in CI: We added an integration test that performs a write via one simulated request context, then explicitly removes the session, then performs a read in a new context — and asserts the read returns the updated value. This test runs on every PR. It would have caught this bug before it reached staging.
-
Mandatory teardown linting: We added a custom Flake8 rule that errors
if any Flask app module imports
scoped_sessionbut does not callSession.remove()in a function decorated with@app.teardown_appcontext. The rule is crude but effective — it fails the build if the safety net is missing. -
Session-per-request in new services: For new Flask services, we
switched to a context-manager pattern — no global scoped registry, just a
with get_db_session() as session:block in each route handler. Explicit scope, explicit cleanup, no implicit global state. The teardown approach works, but the context-manager pattern makes the lifecycle impossible to miss in a code review. -
Query emission logging in staging: We turned on SQLAlchemy's
echo=Truein the staging environment and wrote a test that fires a write followed by a read and asserts that a SELECT is emitted for the read. If the identity map suppresses the SELECT, the test fails. No SELECT, no pass.
The Broader Pattern: ORM Caches You Didn't Know You Had
Every mature ORM ships with at least one caching layer, and almost none of them
announce themselves loudly. SQLAlchemy has the identity map (first-level cache) and
optionally a second-level cache. Django's ORM evaluates querysets lazily and re-executes
them if you iterate twice, but select_related and prefetch_related
objects are cached on the queryset. Hibernate has first-level (session) and second-level
(shared) caches. Each one is a performance optimisation that becomes a correctness hazard
when the scope boundaries aren't respected.
The identity map is designed for within-transaction consistency, not cross-request consistency. When your session outlives your transaction, you've changed its contract without telling it.
The failure mode is always the same: fast, correct-looking, wrong. There are no exceptions because no error occurred from the ORM's perspective. The session did exactly what it was designed to do — return a cached object efficiently. The bug was in the frame: we were treating a transaction-scoped cache as if it were a request-scoped one.
If you're running Flask with SQLAlchemy outside Flask-SQLAlchemy, grep your codebase
for scoped_session right now. Find every Flask app that imports it. Check
that every one has a @app.teardown_appcontext handler that calls
Session.remove(). If it doesn't, you have a latent stale-read bug that
only fires under concurrent load — which means it fired in production before you saw
it in staging.
After we deployed the fix, read latency for employee endpoints increased by an average of 1.4ms due to the additional SELECT being emitted. It was the best regression we ever shipped.