How a Race Condition in Our Cron Job Sent 2.3 Million Duplicate Emails in One Night
← Back
March 14, 2026Architecture10 min read

How a Race Condition in Our Cron Job Sent 2.3 Million Duplicate Emails in One Night

Published March 14, 202610 min read

At 6:47 AM, our support inbox had 4,800 unread tickets. Every single one was a variation of the same message: "Why did I receive the same email 11 times?" By 7:15 AM, SendGrid had suspended our account. By 8:00 AM, our domain was on three major blacklists. The cause? A cron job. Running on two servers. With no lock. For six months.

We hadn't noticed because most nights it didn't matter. The race window was about 200ms — just long enough to be catastrophic on a night when both servers happened to boot the job at exactly the same second.

The Architecture (Such As It Was)

Our platform sent a nightly email digest to around 210,000 active users — a summary of activity in their workspace over the previous 24 hours. The job was a Python Flask task that ran at 2:00 AM UTC via cron, queried Postgres for users with pending activity, and fired off emails through SendGrid's batch API.

Six months earlier, we'd horizontally scaled our application tier from one server to two for redundancy. We added the cron job to both servers' crontabs — because "if one goes down, the other keeps running." This logic was correct in spirit. Catastrophic in practice.

  THE SETUP (simplified)
  ─────────────────────────────────────────────────────
  
  Server A (cron)          Server B (cron)
       │                        │
       │    2:00:00 AM UTC       │
       ├───────────────────────►│  Both fire simultaneously
       │                        │
       ▼                        ▼
  SELECT users WHERE       SELECT users WHERE
  has_activity = true      has_activity = true
       │                        │
       │  ← Same 210,847 rows ──┤
       │                        │
       ▼                        ▼
  Send 210,847 emails      Send 210,847 emails
  
  Result: 421,694 emails sent (2x)
  
  But on the bad night:
  Retry logic + partial failures = 11x per user
  Total: ~2.3 million emails

Why It Was Fine Until It Wasn't

On most nights, Server A won the race by a few hundred milliseconds — it marked users' digests as sent = true in Postgres before Server B's SELECT ran. Server B would query an empty result set and exit cleanly. Total duplicate emails: zero. We had no idea we were one slow database query away from disaster.

The night it failed, our primary Postgres replica was lagging behind by about 3 seconds due to an unrelated heavy analytics query. Server A marked users as sent — but Server B was reading from the lagging replica. It saw all 210,847 users as unsent. Both servers sent the full digest. Then our retry logic kicked in for the ~8% of emails that soft-failed on the first pass. Then the retry of the retry. By the time the job finished at 3:40 AM, every user had received between 9 and 14 emails.

  WHAT ACTUALLY HAPPENED
  ─────────────────────────────────────────────────────
  
  2:00:00 AM  Server A starts, reads 210,847 users (primary)
  2:00:00 AM  Server B starts, reads 210,847 users (lagging replica)
  
  2:00:03 AM  Server A marks users sent=true (primary)
  2:00:03 AM  Server B still sees sent=false (replica 3s behind)
  
  2:00:04 AM  Both servers begin sending → 421,694 emails
  
  2:00:30 AM  ~8% soft failures on both servers
  2:00:31 AM  Retry #1 → another 33,735 emails per server
  2:01:00 AM  Retry #2 fires for partial failures
  
              ... continues until 3:40 AM ...
  
  Total emails sent: ~2.3 million
  Users affected: 210,847
  Average per user: 10.9 emails

The Discovery

I woke up to a Slack message from our CTO at 6:52 AM: "Have you seen support?" I hadn't. I opened the dashboard. We had a 4,847-ticket queue that hadn't existed when I went to bed. SendGrid's dashboard showed our bounce rate had spiked to 23% — anything over 5% is a red flag. Our account was already suspended pending review.

The first thing I did was grep the logs. The cron job timestamps told the story immediately:

# Server A logs
[2026-03-13 02:00:00] Starting nightly digest job
[2026-03-13 02:00:01] Querying users with pending activity...
[2026-03-13 02:00:01] Found 210,847 users. Beginning send.
[2026-03-13 02:00:03] Marked 210,847 users as sent=true
[2026-03-13 03:38:12] Job complete. Sent: 198,203. Failed: 12,644. Retried: 12,644.

# Server B logs (same timestamps)
[2026-03-13 02:00:00] Starting nightly digest job
[2026-03-13 02:00:01] Querying users with pending activity...
[2026-03-13 02:00:01] Found 210,847 users. Beginning send.
[2026-03-13 02:00:05] Marked 210,847 users as sent=true
[2026-03-13 03:40:31] Job complete. Sent: 199,041. Failed: 11,806. Retried: 11,806.

Both servers, same second, same user count. The replica lag meant neither could protect the other. Six months of luck, gone in one bad night.

The Fix: Distributed Locking

The correct solution is a distributed lock — a mechanism that ensures only one instance of a job runs at a time across all servers. We implemented it using Redis with a TTL-based lock pattern:

import redis
import time
import uuid

redis_client = redis.Redis(host='your-redis-host', decode_responses=True)

LOCK_KEY = 'cron:nightly_digest:lock'
LOCK_TTL = 7200  # 2 hours — max expected runtime + buffer

def run_nightly_digest():
    lock_id = str(uuid.uuid4())
    
    # SET NX EX: Set only if Not eXists, with EXpiry
    acquired = redis_client.set(LOCK_KEY, lock_id, nx=True, ex=LOCK_TTL)
    
    if not acquired:
        print(f"Lock already held. Another instance is running. Exiting.")
        return
    
    print(f"Lock acquired: {lock_id}")
    
    try:
        # ✅ Only ONE server will ever reach this point
        users = get_users_with_pending_activity()
        send_digest_emails(users)
        mark_users_sent(users)
    finally:
        # Release ONLY our lock (don't release if someone else somehow holds it)
        release_script = """
            if redis.call('get', KEYS[1]) == ARGV[1] then
                return redis.call('del', KEYS[1])
            else
                return 0
            end
        """
        redis_client.eval(release_script, 1, LOCK_KEY, lock_id)
        print(f"Lock released: {lock_id}")

run_nightly_digest()

The key details here:

  • SET NX EX is atomic — no race condition between checking and setting the lock
  • Lock includes a unique ID — so we only release locks we actually own
  • Lua script for release — check-and-delete is atomic; prevents releasing another server's lock if our TTL expired
  • TTL is generous — set it to 2x your worst-case runtime so a crashed server doesn't hold the lock forever

The Second Fix: Read From Primary

The distributed lock would have prevented the disaster even with replica lag. But we also fixed the underlying fragility: the digest job now reads directly from the Postgres primary, not the replica. For a once-a-night job, the added primary load is negligible. For any write-sensitive read, you should always target primary.

# SQLAlchemy — explicitly target primary for write-sensitive reads
with engine.connect().execution_options(
    postgresql_readonly=False  # Forces primary even in read-only mode configs
) as conn:
    users = conn.execute(
        text("SELECT id, email FROM users WHERE digest_sent_today = false")
    ).fetchall()

The Third Fix: Idempotency at the Row Level

Even with a distributed lock and primary reads, we added one more layer of protection: a database-level guard using UPDATE ... RETURNING to atomically claim users before sending their email. This means even if two processes somehow get through, only one can claim each user row.

-- Atomically claim users and return only unclaimed ones
-- No other process can claim the same rows simultaneously
UPDATE users
SET digest_claimed_at = NOW(), digest_claimed_by = 'server-a-job-id'
WHERE digest_sent_today = false
  AND digest_claimed_at IS NULL
RETURNING id, email;

Damage Control

The technical fix took 40 minutes. The damage control took 3 weeks.

  • SendGrid: We called their support line directly. They reviewed our account, confirmed it was a technical error (our historical stats were clean), and reinstated us within 6 hours.
  • Blacklists: We submitted removal requests to Spamhaus, SORBS, and Barracuda. Spamhaus removed us in 48 hours. The others took 10–14 days. During that period, emails to some corporate inboxes (heavy Barracuda users) went straight to spam.
  • Users: We sent one apology email (irony noted). We offered 1 month free. Churn that month was 2.1% vs our usual 0.4%.
  • Domain reputation: Took about 3 weeks of low-volume, high-engagement sending to recover our sender score fully.

What to Watch For

If you're running scheduled jobs across multiple servers, ask yourself:

  1. Can two instances run simultaneously? If yes, is that safe?
  2. Do you read from replicas in write-sensitive flows? Replica lag is real.
  3. Is your retry logic bounded? Unbounded retries compound failures.
  4. Would you know if a job ran twice? Add a job execution log with server ID and timestamp.

Distributed locks aren't glamorous. Neither is spending three weeks rebuilding your domain reputation. One Redis SET NX EX would have cost us nothing. The race condition cost us three weeks, a domain blacklisting, and a noticeable churn spike.

Add the lock before you need it.

Share this
← All Posts10 min read