How a Single Power User's Post Triggered 45,000 DB Queries and Crashed Our Mobile API
At 3:22 PM on a Tuesday, mobile crash reports spiked from background noise to 847 per minute. Our iOS and Android apps were returning blank feeds, failed logins, and unresponsive comment threads. The on-call page fired 11 seconds later. By the time I opened the dashboard, 62% of mobile API requests were failing with 503s. It took us 3 hours to restore full service — and the root cause was a single user clicking the "Post" button.
Production Failure
We ran a social-layer feature inside our SaaS platform's mobile app — users could post updates, follow each other, and receive push notifications when someone they followed published new content. The feature had been live for 8 months without incident.
At 3:21 PM, @techinfluencer — a verified account with 45,312 followers — published a product update post. Standard action. Our backend processed it identically to the 2,000 other posts created that day.
Within 90 seconds, our Flask API pods were rejecting connections. Datadog showed:
The database connection pool was fully saturated. New requests couldn't acquire a connection and were timing out after the 30-second wait limit, then failing with 503.
False Assumptions
Our first assumption was a DDoS or traffic spike. The app had just been featured in a tech newsletter that morning and daily active users were up 18%. The ops team checked CloudFront metrics — CDN traffic was elevated but not unusual. No flood of incoming requests.
Second assumption: a runaway slow query. We pulled slow query logs from PostgreSQL and found nothing over 500ms. The individual queries were fast. The problem wasn't query performance — it was query volume.
Third assumption: a memory leak in a Flask worker. We'd had an issue 3 months earlier where a background thread leaked file handles. We restarted the pods. Partial recovery for 4 minutes, then the same failure pattern resumed. The trigger was still running.
No one initially connected the outage to a specific user's post. We were looking at aggregate metrics, not per-request traces.
Profiling the Failure Path
The breakthrough came from our distributed tracing setup (Jaeger). A senior engineer filtered traces by duration > 10 seconds and found hundreds of traces originating from a single endpoint: POST /api/v1/posts/create. Each trace showed the same pattern — one HTTP request spawning thousands of child database spans.
The trace for @techinfluencer's post creation:
POST /api/v1/posts/create [duration: 312s]
│
├── INSERT INTO posts (...) [2ms]
├── SELECT user_id FROM follows WHERE [8ms]
│ followed_user_id = ? → 45,312 rows
│
└── for follower_id in follower_ids: [loop × 45,312]
├── SELECT device_token FROM devices [avg 6ms each]
│ WHERE user_id = ? [= 271s total]
│
└── POST https://fcm.googleapis.com/send [avg 45ms each]
(synchronous HTTP — blocking) [= 2,039s serial]
The post creation endpoint was holding a database connection open for the entire duration of the fanout loop — including the 45ms FCM HTTP round trips. With a pool size of 20 connections, a single post from a power user monopolized all 20 connections within seconds, leaving nothing for the rest of the application.
Root Cause
The create_post Flask route performed a synchronous fanout — iterating over every follower and making one database query plus one FCM HTTP call per follower, all within the same request context. The SQLAlchemy session (and its borrowed connection) stayed open throughout.
@posts_bp.route('/create', methods=['POST'])
@jwt_required()
def create_post():
user_id = get_jwt_identity()
data = request.get_json()
# Insert the post
post = Post(user_id=user_id, content=data['content'])
db.session.add(post)
db.session.commit()
# Fanout: notify all followers — THIS IS THE PROBLEM
followers = db.session.execute(
text("SELECT user_id FROM follows WHERE followed_user_id = :uid"),
{"uid": user_id}
).fetchall()
# One DB query + one FCM call per follower, synchronously
# DB connection held open the entire time
for (follower_id,) in followers:
device = db.session.execute(
text("SELECT token FROM devices WHERE user_id = :uid LIMIT 1"),
{"uid": follower_id}
).fetchone()
if device:
requests.post(
"https://fcm.googleapis.com/v1/projects/myapp/messages:send",
headers={"Authorization": f"Bearer ${get_fcm_token()}"},
json={"message": {"token": device.token, "notification": {
"title": "New post",
"body": data['content'][:100]
}}}
) # Synchronous — blocks the thread and holds DB connection
return jsonify({"status": "ok", "post_id": post.id})
Three compounding mistakes:
- N+1 device token queries: one SELECT per follower instead of a single bulk fetch
- Synchronous FCM calls inside the request: each 45ms network call extended the connection hold time
- No follower count guard: the same code path ran for a user with 12 followers and a user with 45,312
The math was brutal: 45,312 followers × (6ms DB + 45ms FCM) = ~2,300 seconds of serial work per post, holding one DB connection the entire time. At 20 connections in the pool, just 20 concurrent power-user posts would deadlock the entire API.
Architecture Fix
We needed to decouple the fanout from the HTTP request entirely. The post creation response should return in milliseconds; the notification work should happen asynchronously, in bulk, with its own resource allocation.
We chose Celery with Redis as the task queue. Why Celery over a raw thread pool or asyncio? Celery tasks survive pod restarts, support retries with backoff, and can be monitored via Flower. Asyncio would have unblocked the request but still tied the fanout to the same process and connection pool. A dedicated worker fleet with its own pool was the right separation.
BEFORE (synchronous fanout):
Mobile Client
│
▼
Flask API Pod (1 request)
│
├── INSERT post [DB conn acquired]
│
└── for each follower: [DB conn HELD the entire loop]
├── SELECT device token
└── POST FCM (45ms network) ← blocks conn × 45,312 times
[DB conn released — after 312s]
AFTER (async fanout via Celery):
Mobile Client
│
▼
Flask API Pod (1 request)
│
├── INSERT post [DB conn acquired]
├── db.session.commit() [DB conn released — after 10ms]
├── fanout_task.delay( [Enqueue task to Redis — 1ms]
│ post_id, user_id)
└── return {"status":"ok"} [Response in ~15ms]
│
▼ (async, separate process)
Celery Worker Pool
│
├── SELECT user_id FROM follows ... [one query, all followers]
├── SELECT user_id, token FROM devices [one query, all tokens]
│ WHERE user_id = ANY(follower_ids)
│
└── for chunk in chunks(tokens, 500): [FCM batch API]
POST fcm.googleapis.com/v1/.../messages:send
body: { "messages": [... up to 500 ...] }
[45ms per batch, not per user]
The rewritten fanout task:
from celery import shared_task
from sqlalchemy import text
from itertools import islice
from db import engine
import httpx
FCM_BATCH_SIZE = 500
def chunked(iterable, size):
it = iter(iterable)
while chunk := list(islice(it, size)):
yield chunk
@shared_task(bind=True, max_retries=3, default_retry_delay=30)
def fanout_push_notifications(self, post_id: int, author_id: int, content: str):
"""
Async fanout: bulk-fetch all follower device tokens,
then send via FCM batch API in chunks of 500.
Uses its own DB connection — never touches the web pool.
"""
try:
with engine.connect() as conn:
# Single query for all follower tokens
rows = conn.execute(text("""
SELECT d.token
FROM follows f
JOIN devices d ON d.user_id = f.user_id
WHERE f.followed_user_id = :author_id
AND d.token IS NOT NULL
"""), {"author_id": author_id}).fetchall()
tokens = [row.token for row in rows]
if not tokens:
return {"sent": 0}
fcm_token = get_fcm_access_token()
sent = 0
for batch in chunked(tokens, FCM_BATCH_SIZE):
messages = [
{"token": t, "notification": {
"title": "New post",
"body": content[:100]
}} for t in batch
]
resp = httpx.post(
"https://fcm.googleapis.com/batch",
headers={"Authorization": f"Bearer ${fcm_token}"},
json={"messages": messages},
timeout=30
)
resp.raise_for_status()
sent += len(batch)
return {"sent": sent, "batches": len(tokens) // FCM_BATCH_SIZE + 1}
except Exception as exc:
raise self.retry(exc=exc)
The updated route reduced to:
@posts_bp.route('/create', methods=['POST'])
@jwt_required()
def create_post():
user_id = get_jwt_identity()
data = request.get_json()
post = Post(user_id=user_id, content=data['content'])
db.session.add(post)
db.session.commit() # Connection released immediately after commit
# Fire and forget — Celery handles the rest
fanout_push_notifications.delay(post.id, user_id, data['content'])
return jsonify({"status": "ok", "post_id": post.id}), 201
We also added a safeguard at the database level — a materialized view caching follower counts, updated every 5 minutes. For users with more than 10,000 followers we throttle fanout frequency to at most once every 60 seconds per author using a Redis lock. This prevents a coordinated burst (multiple power users posting simultaneously) from overwhelming the Celery queue.
Results After Deploy
The @techinfluencer test account posted again 2 days after the fix. The post creation API responded in 13ms. Celery processed all 45,312 notifications across 91 FCM batch requests in 4.1 seconds in the background. Zero impact on API availability.
Lessons Learned
- Fanout is not a request-time operation. Any work that scales with follower count, subscriber count, or list size belongs in an async task queue — not in the HTTP request lifecycle. The API response time should be O(1), not O(followers).
- Database connections are the bottleneck, not CPU. A pool of 20 connections sounds generous until one request monopolizes all 20 for 312 seconds. Design connection hold time to be as short as possible — commit early, release early.
- Test with power-user data, not average data. Our local test seed had users with at most 50 followers. The production failure pattern only emerged at 45,000. Load tests should include outlier accounts that represent the extreme tail of your distribution.
- Distributed traces catch what aggregate metrics miss. CPU, memory, and request rate all looked manageable. Only the per-request trace showed the 45,000 child spans hiding inside a single slow request. Tracing is non-optional for mobile backends.
- Use vendor batch APIs. FCM supports up to 500 messages per batch request. Sending 45,000 individual HTTP calls when a batch API exists is a design smell — and a production risk.
— Darshan Turakhia