How a Single Power User's Post Triggered 45,000 DB Queries and Crashed Our Mobile API
← Back
March 14, 2026Mobile9 min read

How a Single Power User's Post Triggered 45,000 DB Queries and Crashed Our Mobile API

Published March 14, 20269 min read

At 3:22 PM on a Tuesday, mobile crash reports spiked from background noise to 847 per minute. Our iOS and Android apps were returning blank feeds, failed logins, and unresponsive comment threads. The on-call page fired 11 seconds later. By the time I opened the dashboard, 62% of mobile API requests were failing with 503s. It took us 3 hours to restore full service — and the root cause was a single user clicking the "Post" button.


Production Failure

We ran a social-layer feature inside our SaaS platform's mobile app — users could post updates, follow each other, and receive push notifications when someone they followed published new content. The feature had been live for 8 months without incident.

At 3:21 PM, @techinfluencer — a verified account with 45,312 followers — published a product update post. Standard action. Our backend processed it identically to the 2,000 other posts created that day.

Within 90 seconds, our Flask API pods were rejecting connections. Datadog showed:

847/min Mobile crash reports
62% API requests failing (503)
45s API p99 latency (was 210ms)
3 hrs Time to full recovery

The database connection pool was fully saturated. New requests couldn't acquire a connection and were timing out after the 30-second wait limit, then failing with 503.


False Assumptions

Our first assumption was a DDoS or traffic spike. The app had just been featured in a tech newsletter that morning and daily active users were up 18%. The ops team checked CloudFront metrics — CDN traffic was elevated but not unusual. No flood of incoming requests.

Second assumption: a runaway slow query. We pulled slow query logs from PostgreSQL and found nothing over 500ms. The individual queries were fast. The problem wasn't query performance — it was query volume.

Third assumption: a memory leak in a Flask worker. We'd had an issue 3 months earlier where a background thread leaked file handles. We restarted the pods. Partial recovery for 4 minutes, then the same failure pattern resumed. The trigger was still running.

No one initially connected the outage to a specific user's post. We were looking at aggregate metrics, not per-request traces.


Profiling the Failure Path

The breakthrough came from our distributed tracing setup (Jaeger). A senior engineer filtered traces by duration > 10 seconds and found hundreds of traces originating from a single endpoint: POST /api/v1/posts/create. Each trace showed the same pattern — one HTTP request spawning thousands of child database spans.

The trace for @techinfluencer's post creation:

POST /api/v1/posts/create  [duration: 312s]
│
├── INSERT INTO posts (...)                     [2ms]
├── SELECT user_id FROM follows WHERE           [8ms]
│   followed_user_id = ?   → 45,312 rows
│
└── for follower_id in follower_ids:  [loop × 45,312]
    ├── SELECT device_token FROM devices        [avg 6ms each]
    │   WHERE user_id = ?                       [= 271s total]
    │
    └── POST https://fcm.googleapis.com/send    [avg 45ms each]
        (synchronous HTTP — blocking)           [= 2,039s serial]

The post creation endpoint was holding a database connection open for the entire duration of the fanout loop — including the 45ms FCM HTTP round trips. With a pool size of 20 connections, a single post from a power user monopolized all 20 connections within seconds, leaving nothing for the rest of the application.

The endpoint was holding a DB connection hostage across 45,000 network calls to Google's FCM servers. Every millisecond of network latency multiplied across the follower count.

Root Cause

The create_post Flask route performed a synchronous fanout — iterating over every follower and making one database query plus one FCM HTTP call per follower, all within the same request context. The SQLAlchemy session (and its borrowed connection) stayed open throughout.

api/routes/posts.py — the original broken implementation
@posts_bp.route('/create', methods=['POST'])
@jwt_required()
def create_post():
    user_id = get_jwt_identity()
    data = request.get_json()

    # Insert the post
    post = Post(user_id=user_id, content=data['content'])
    db.session.add(post)
    db.session.commit()

    # Fanout: notify all followers — THIS IS THE PROBLEM
    followers = db.session.execute(
        text("SELECT user_id FROM follows WHERE followed_user_id = :uid"),
        {"uid": user_id}
    ).fetchall()

    # One DB query + one FCM call per follower, synchronously
    # DB connection held open the entire time
    for (follower_id,) in followers:
        device = db.session.execute(
            text("SELECT token FROM devices WHERE user_id = :uid LIMIT 1"),
            {"uid": follower_id}
        ).fetchone()

        if device:
            requests.post(
                "https://fcm.googleapis.com/v1/projects/myapp/messages:send",
                headers={"Authorization": f"Bearer ${get_fcm_token()}"},
                json={"message": {"token": device.token, "notification": {
                    "title": "New post",
                    "body": data['content'][:100]
                }}}
            )  # Synchronous — blocks the thread and holds DB connection

    return jsonify({"status": "ok", "post_id": post.id})

Three compounding mistakes:

  • N+1 device token queries: one SELECT per follower instead of a single bulk fetch
  • Synchronous FCM calls inside the request: each 45ms network call extended the connection hold time
  • No follower count guard: the same code path ran for a user with 12 followers and a user with 45,312

The math was brutal: 45,312 followers × (6ms DB + 45ms FCM) = ~2,300 seconds of serial work per post, holding one DB connection the entire time. At 20 connections in the pool, just 20 concurrent power-user posts would deadlock the entire API.


Architecture Fix

We needed to decouple the fanout from the HTTP request entirely. The post creation response should return in milliseconds; the notification work should happen asynchronously, in bulk, with its own resource allocation.

We chose Celery with Redis as the task queue. Why Celery over a raw thread pool or asyncio? Celery tasks survive pod restarts, support retries with backoff, and can be monitored via Flower. Asyncio would have unblocked the request but still tied the fanout to the same process and connection pool. A dedicated worker fleet with its own pool was the right separation.

BEFORE (synchronous fanout):

Mobile Client
    │
    ▼
Flask API Pod (1 request)
    │
    ├── INSERT post            [DB conn acquired]
    │
    └── for each follower:     [DB conn HELD the entire loop]
        ├── SELECT device token
        └── POST FCM (45ms network)    ← blocks conn × 45,312 times
                                        [DB conn released — after 312s]


AFTER (async fanout via Celery):

Mobile Client
    │
    ▼
Flask API Pod (1 request)
    │
    ├── INSERT post            [DB conn acquired]
    ├── db.session.commit()    [DB conn released — after 10ms]
    ├── fanout_task.delay(     [Enqueue task to Redis — 1ms]
    │     post_id, user_id)
    └── return {"status":"ok"} [Response in ~15ms]

          │
          ▼ (async, separate process)

Celery Worker Pool
    │
    ├── SELECT user_id FROM follows ...     [one query, all followers]
    ├── SELECT user_id, token FROM devices  [one query, all tokens]
    │   WHERE user_id = ANY(follower_ids)
    │
    └── for chunk in chunks(tokens, 500):  [FCM batch API]
        POST fcm.googleapis.com/v1/.../messages:send
        body: { "messages": [... up to 500 ...] }
        [45ms per batch, not per user]

The rewritten fanout task:

tasks/notifications.py — async bulk fanout
from celery import shared_task
from sqlalchemy import text
from itertools import islice
from db import engine
import httpx

FCM_BATCH_SIZE = 500

def chunked(iterable, size):
    it = iter(iterable)
    while chunk := list(islice(it, size)):
        yield chunk

@shared_task(bind=True, max_retries=3, default_retry_delay=30)
def fanout_push_notifications(self, post_id: int, author_id: int, content: str):
    """
    Async fanout: bulk-fetch all follower device tokens,
    then send via FCM batch API in chunks of 500.
    Uses its own DB connection — never touches the web pool.
    """
    try:
        with engine.connect() as conn:
            # Single query for all follower tokens
            rows = conn.execute(text("""
                SELECT d.token
                FROM follows f
                JOIN devices d ON d.user_id = f.user_id
                WHERE f.followed_user_id = :author_id
                  AND d.token IS NOT NULL
            """), {"author_id": author_id}).fetchall()

        tokens = [row.token for row in rows]

        if not tokens:
            return {"sent": 0}

        fcm_token = get_fcm_access_token()
        sent = 0

        for batch in chunked(tokens, FCM_BATCH_SIZE):
            messages = [
                {"token": t, "notification": {
                    "title": "New post",
                    "body": content[:100]
                }} for t in batch
            ]
            resp = httpx.post(
                "https://fcm.googleapis.com/batch",
                headers={"Authorization": f"Bearer ${fcm_token}"},
                json={"messages": messages},
                timeout=30
            )
            resp.raise_for_status()
            sent += len(batch)

        return {"sent": sent, "batches": len(tokens) // FCM_BATCH_SIZE + 1}

    except Exception as exc:
        raise self.retry(exc=exc)

The updated route reduced to:

api/routes/posts.py — after fix
@posts_bp.route('/create', methods=['POST'])
@jwt_required()
def create_post():
    user_id = get_jwt_identity()
    data = request.get_json()

    post = Post(user_id=user_id, content=data['content'])
    db.session.add(post)
    db.session.commit()  # Connection released immediately after commit

    # Fire and forget — Celery handles the rest
    fanout_push_notifications.delay(post.id, user_id, data['content'])

    return jsonify({"status": "ok", "post_id": post.id}), 201

We also added a safeguard at the database level — a materialized view caching follower counts, updated every 5 minutes. For users with more than 10,000 followers we throttle fanout frequency to at most once every 60 seconds per author using a Redis lock. This prevents a coordinated burst (multiple power users posting simultaneously) from overwhelming the Celery queue.


Results After Deploy

~15ms Post creation latency (was 45s p99)
0% 503 error rate (was 62%)
91× FCM calls reduced (batch vs individual)
3 workers Celery pods handle all fanout load

The @techinfluencer test account posted again 2 days after the fix. The post creation API responded in 13ms. Celery processed all 45,312 notifications across 91 FCM batch requests in 4.1 seconds in the background. Zero impact on API availability.


Lessons Learned

  • Fanout is not a request-time operation. Any work that scales with follower count, subscriber count, or list size belongs in an async task queue — not in the HTTP request lifecycle. The API response time should be O(1), not O(followers).
  • Database connections are the bottleneck, not CPU. A pool of 20 connections sounds generous until one request monopolizes all 20 for 312 seconds. Design connection hold time to be as short as possible — commit early, release early.
  • Test with power-user data, not average data. Our local test seed had users with at most 50 followers. The production failure pattern only emerged at 45,000. Load tests should include outlier accounts that represent the extreme tail of your distribution.
  • Distributed traces catch what aggregate metrics miss. CPU, memory, and request rate all looked manageable. Only the per-request trace showed the 45,000 child spans hiding inside a single slow request. Tracing is non-optional for mobile backends.
  • Use vendor batch APIs. FCM supports up to 500 messages per batch request. Sending 45,000 individual HTTP calls when a batch API exists is a design smell — and a production risk.
The code that worked fine for 8 months and 50,000 daily posts was never tested against a single user with 45,000 followers. Scale assumptions are invisible until they're catastrophic.

— Darshan Turakhia

Share this
← All Posts9 min read