How a Redis Cache Key Missing One Field Leaked Client Data Across Tenants for 72 Hours
On a Tuesday afternoon, a client emailed support saying their dashboard was showing the wrong company name. We assumed a display bug — a stale frontend cache, a mismatched JOIN. It took us four hours to accept the real answer: for 72 hours, one paying enterprise tenant had been reading another tenant's confidential project data, served silently by our own Redis cache.
Production Failure
The platform was a project management SaaS — multiple enterprise clients, each with their own isolated
workspace. Tenant isolation was enforced at the database layer: every query scoped by tenant_id,
every record owned by exactly one org. We'd audited this twice. The database layer was clean.
What we hadn't audited was the caching layer.
Our Flask API cached expensive responses in Redis using keys like project:{project_id}.
Project IDs were auto-incrementing integers from PostgreSQL. Tenant A's project #1041 and
Tenant B's project #1041 are two different records — but to Redis, they were the same key.
Tenant A loaded their project first. Redis cached it as project:1041.
Tenant B loaded theirs 40 minutes later. Redis returned Tenant A's data. Tenant B's UI
rendered it without complaint — the shape matched, the fields matched, only the content was wrong.
False Assumptions
The first instinct was frontend. The React app maintained local state — maybe a stale context, a component that didn't re-fetch on navigation. We spent 90 minutes in browser devtools before confirming that the API itself was returning the wrong data.
Second instinct: the database query. We grepped every query file for missing tenant_id filters.
Found nothing. Ran the suspect query manually with both tenant IDs — each returned exactly the right rows.
The database was correct.
"The database is clean. The API returns the wrong data. If the query is right and the result is wrong, something between the query and the response is substituting the answer."
That sentence is what finally pointed us at the cache.
Reproducing the Poisoning
Reproducing it took 8 minutes once we had the hypothesis. Two test tenant accounts, each with a project created fresh so they'd share an auto-increment ID. Load Tenant A's project endpoint — Redis caches it. Switch auth header to Tenant B, hit the same endpoint. Redis returns Tenant A's payload. Confirmed.
POISONED REQUEST FLOW
─────────────────────────────────────────────────────────────
Tenant A — GET /api/projects/1041
┌─────────┐ ┌──────────────────┐ ┌────────────────┐
│ Client A │────▶│ Flask API │────▶│ Redis │
└─────────┘ │ │ │ MISS │
│ cache_key = │ │ │
│ "project:1041" │◀────│ SET project: │
│ │ │ 1041 = {A data}│
└──────────────────┘ └────────────────┘
│
▼
DB query WHERE id=1041 ← correct, returns A's row
Cache SET project:1041 ← keyed without tenant
40 minutes later — Tenant B — GET /api/projects/1041
┌─────────┐ ┌──────────────────┐ ┌────────────────┐
│ Client B │────▶│ Flask API │────▶│ Redis │
└─────────┘ │ │ │ HIT ✓ │
│ cache_key = │◀────│ "project:1041" │
│ "project:1041" │ │ = {A data} ⚠️ │
└──────────────────┘ └────────────────┘
│
▼
Returns Tenant A's data to Tenant B ← never touches DB
CORRECT FLOW (after fix)
─────────────────────────────────────────────────────────────
Tenant B — GET /api/projects/1041
┌─────────┐ ┌──────────────────────────────────────────┐
│ Client B │────▶│ cache_key = "project:{tenant_id}:1041" │
└─────────┘ │ = "project:tenant_b_uuid:1041" │
│ │
│ Redis MISS → DB query → cache SET │
│ Returns Tenant B's data ✓ │
└──────────────────────────────────────────┘
Root Cause: Cache Keys Built Without Tenant Scope
The caching helper had been written before multi-tenancy was added to the platform. When the tenant layer was bolted on, the database queries were updated correctly — but the cache key builder was never touched.
# BEFORE — tenant-blind cache key
def get_project(project_id: int):
cache_key = f"project:{project_id}"
cached = redis.get(cache_key)
if cached:
return json.loads(cached)
row = db.execute(
"SELECT * FROM projects WHERE id = %s AND tenant_id = %s",
(project_id, g.tenant_id) # DB query is scoped correctly
).fetchone()
redis.setex(cache_key, 300, json.dumps(row)) # cache key is NOT
return row
# AFTER — tenant-scoped cache key
def get_project(project_id: int):
# Include tenant_id in the key — different tenants never share a cache entry
cache_key = f"project:{g.tenant_id}:{project_id}"
cached = redis.get(cache_key)
if cached:
return json.loads(cached)
row = db.execute(
"SELECT * FROM projects WHERE id = %s AND tenant_id = %s",
(project_id, g.tenant_id)
).fetchone()
if row is None:
return None # also: don't cache a None — that's a separate bug
redis.setex(cache_key, 300, json.dumps(row))
return row
# ALSO ADDED — cache key audit helper (run in CI)
CACHE_KEY_PATTERNS = {
"project": "project:{tenant_id}:{project_id}",
"member": "member:{tenant_id}:{member_id}",
"report": "report:{tenant_id}:{report_id}:{date_range}",
}
# Any cache SET that doesn't match a known pattern raises in staging
Architecture Fix: Tenant-Scoped Keys + Cache Audit Layer
The immediate fix was straightforward — prefix every cache key with tenant_id.
We chose the tenant UUID (not the integer PK) specifically to prevent enumeration: an attacker
who can influence a cache key should not be able to guess another tenant's key by incrementing an integer.
We considered row-level security in PostgreSQL as the "real" fix — making it structurally
impossible for a query to return cross-tenant data even if the WHERE tenant_id clause
is omitted. We'll get there. But RLS requires a schema migration and careful testing across 40+
query sites. The cache key fix was safe, isolated, and deployable in under an hour.
The second layer we added: a cache key registry. Every valid cache key shape is declared in a central manifest. In staging, any cache write using an unregistered pattern raises an exception. This turns "someone wrote a cache key without tenant scope" from a silent production bug into a CI failure.
CACHE KEY AUDIT IN CI PIPELINE
─────────────────────────────────────────────────────────────
Developer writes new cached endpoint
│
▼
cache.set("new_resource:{id}", data)
│
▼
CI: run cache_key_audit.py
│
├── Key matches registered pattern? ──▶ ✅ PASS
│
└── Key NOT in registry? ──────────────▶ ❌ FAIL
"Cache key 'new_resource:{id}'
missing tenant scope.
Register in CACHE_KEY_PATTERNS
or add tenant_id prefix."
Lessons Learned
- Tenant isolation is not just a database concern. Every layer that persists or caches data — Redis, CDN, in-memory stores, even log aggregators — must be audited for tenant scope when multi-tenancy is added.
- Auto-increment IDs are collision-prone across tenants. If you use integer PKs, two tenants will always eventually have the same resource ID. Cache keys must carry the tenant identifier.
- Silent correctness failures are worse than crashes. This ran for 72 hours with zero error rates, zero latency spikes, zero alerts. The only signal was a client email. Invest in data-correctness checks, not just availability monitoring.
- The fix for a multi-tenancy gap is rarely just the database. When multi-tenancy is layered onto a single-tenant codebase, assume every component that caches, queues, or stores data has the same blindspot.