March 14, 2026Architecture10 min read

How a Race Condition in Our Cron Job Sent 2.3 Million Duplicate Emails in One Night

Published March 14, 202610 min read

6:47 AM. Our support inbox had 4,800 unread tickets. Every single one was a variation on the same theme: "Why did I receive the same email 11 times?" By 7:15 AM SendGrid had suspended our account. By 8:00 AM our domain was on three major blacklists. The cause was a cron job running on two servers with no lock, and it had been that way for six months.

We hadn't noticed because most nights it didn't matter. The race window was around 200ms, just long enough to be catastrophic on a night when both servers happened to boot the job at exactly the same second. I'm going to walk through exactly how a simple job went from fine for six months to catastrophic in one night.

The architecture (such as it was)

Our platform sent a nightly email digest to around 210,000 active users, a summary of activity in their workspace over the previous 24 hours. The job was a Python Flask task running at 2:00 AM UTC via cron, querying Postgres for users with pending activity and firing off emails through SendGrid's batch API.

Six months earlier, we'd scaled our application tier from one server to two for redundancy. We added the cron job to both servers' crontabs because "if one goes down, the other keeps running." Correct in spirit. Catastrophic in practice. This is the bit where, in hindsight, any engineer with a scheduler background would have stopped and asked a question, and none of us did.

  THE SETUP (simplified)
  ─────────────────────────────────────────────────────
  
  Server A (cron)          Server B (cron)
       │                        │
       │    2:00:00 AM UTC       │
       ├───────────────────────►│  Both fire simultaneously
       │                        │
       ▼                        ▼
  SELECT users WHERE       SELECT users WHERE
  has_activity = true      has_activity = true
       │                        │
       │  ← Same 210,847 rows ──┤
       │                        │
       ▼                        ▼
  Send 210,847 emails      Send 210,847 emails
  
  Result: 421,694 emails sent (2x)
  
  But on the bad night:
  Retry logic + partial failures = 11x per user
  Total: ~2.3 million emails

Why it was fine until it wasn't

On most nights, Server A won the race by a few hundred milliseconds. It marked users' digests as sent = true in Postgres before Server B's SELECT ran. Server B would query an empty result set and exit cleanly. Total duplicate emails: zero. We were one slow database query away from disaster, and we had no idea.

The night it failed, our Postgres read replica was lagging by about 3 seconds because of an unrelated heavy analytics query. Server A marked users as sent. Server B was reading from the lagging replica and saw all 210,847 users as unsent. Both servers sent the full digest. Then our retry logic kicked in for the 8% of emails that soft-failed on the first pass. Then the retry of the retry. By the time the job finished at 3:40 AM, every user had received between 9 and 14 emails.

  WHAT ACTUALLY HAPPENED
  ─────────────────────────────────────────────────────
  
  2:00:00 AM  Server A starts, reads 210,847 users (primary)
  2:00:00 AM  Server B starts, reads 210,847 users (lagging replica)
  
  2:00:03 AM  Server A marks users sent=true (primary)
  2:00:03 AM  Server B still sees sent=false (replica 3s behind)
  
  2:00:04 AM  Both servers begin sending → 421,694 emails
  
  2:00:30 AM  ~8% soft failures on both servers
  2:00:31 AM  Retry #1 → another 33,735 emails per server
  2:01:00 AM  Retry #2 fires for partial failures
  
              ... continues until 3:40 AM ...
  
  Total emails sent: ~2.3 million
  Users affected: 210,847
  Average per user: 10.9 emails

The discovery

I woke up to a Slack message from our CTO at 6:52 AM. "Have you seen support?" I hadn't. I opened the dashboard. 4,847 tickets that hadn't existed when I went to bed. SendGrid's dashboard showed our bounce rate had spiked to 23% (anything over 5% is a red flag) and the account was already suspended pending review.

I grepped the logs. The cron job timestamps told the whole story in about thirty seconds.

# Server A logs
[2026-03-13 02:00:00] Starting nightly digest job
[2026-03-13 02:00:01] Querying users with pending activity...
[2026-03-13 02:00:01] Found 210,847 users. Beginning send.
[2026-03-13 02:00:03] Marked 210,847 users as sent=true
[2026-03-13 03:38:12] Job complete. Sent: 198,203. Failed: 12,644. Retried: 12,644.

# Server B logs (same timestamps)
[2026-03-13 02:00:00] Starting nightly digest job
[2026-03-13 02:00:01] Querying users with pending activity...
[2026-03-13 02:00:01] Found 210,847 users. Beginning send.
[2026-03-13 02:00:05] Marked 210,847 users as sent=true
[2026-03-13 03:40:31] Job complete. Sent: 199,041. Failed: 11,806. Retried: 11,806.

Both servers, same second, same user count. The replica lag meant neither could protect the other. Six months of good luck, gone in one bad night.

The fix: distributed locking

The correct solution is a distributed lock, a mechanism that ensures only one instance of a job runs at a time across all servers. We used Redis with a TTL-based lock pattern.

import redis
import time
import uuid

redis_client = redis.Redis(host='your-redis-host', decode_responses=True)

LOCK_KEY = 'cron:nightly_digest:lock'
LOCK_TTL = 7200  # 2 hours — max expected runtime + buffer

def run_nightly_digest():
    lock_id = str(uuid.uuid4())
    
    # SET NX EX: Set only if Not eXists, with EXpiry
    acquired = redis_client.set(LOCK_KEY, lock_id, nx=True, ex=LOCK_TTL)
    
    if not acquired:
        print(f"Lock already held. Another instance is running. Exiting.")
        return
    
    print(f"Lock acquired: {lock_id}")
    
    try:
        # ✅ Only ONE server will ever reach this point
        users = get_users_with_pending_activity()
        send_digest_emails(users)
        mark_users_sent(users)
    finally:
        # Release ONLY our lock (don't release if someone else somehow holds it)
        release_script = """
            if redis.call('get', KEYS[1]) == ARGV[1] then
                return redis.call('del', KEYS[1])
            else
                return 0
            end
        """
        redis_client.eval(release_script, 1, LOCK_KEY, lock_id)
        print(f"Lock released: {lock_id}")

run_nightly_digest()

A few details matter here. SET NX EX is atomic, so there's no race between checking and setting. The lock includes a unique ID so we only release locks we own. The Lua release script makes check-and-delete atomic, which prevents accidentally releasing another server's lock if our TTL expired. And the TTL is generous: 2x worst-case runtime, so a crashed server doesn't hold the lock forever.

The second fix: read from primary

The distributed lock alone would have prevented the disaster. But we also fixed the underlying fragility: the digest job now reads directly from the Postgres primary, not the replica. For a once-a-night job, the added primary load is negligible. For any write-sensitive read, you should target primary.

# SQLAlchemy — explicitly target primary for write-sensitive reads
with engine.connect().execution_options(
    postgresql_readonly=False  # Forces primary even in read-only mode configs
) as conn:
    users = conn.execute(
        text("SELECT id, email FROM users WHERE digest_sent_today = false")
    ).fetchall()

The third fix: idempotency at the row level

Even with a distributed lock and primary reads, we added one more layer: a database-level guard using UPDATE ... RETURNING to atomically claim users before sending. Even if two processes somehow get through, only one can claim each user row.

-- Atomically claim users and return only unclaimed ones
-- No other process can claim the same rows simultaneously
UPDATE users
SET digest_claimed_at = NOW(), digest_claimed_by = 'server-a-job-id'
WHERE digest_sent_today = false
  AND digest_claimed_at IS NULL
RETURNING id, email;

Damage control

The technical fix took 40 minutes. The damage control took 3 weeks.

SendGrid: we called their support line directly. They reviewed our account, confirmed it was a technical error (historical stats were clean), and reinstated us within 6 hours.
Blacklists: removal requests went to Spamhaus, SORBS, and Barracuda. Spamhaus removed us in 48 hours. The others took 10 to 14 days. During that window, emails to some corporate inboxes (heavy Barracuda users) went straight to spam.
Users: we sent one apology email. Irony noted. We offered one month free. Churn that month was 2.1% vs our usual 0.4%.
Domain reputation: around 3 weeks of low-volume, high-engagement sending to recover our sender score fully.

What to watch for

If you're running scheduled jobs across multiple servers, ask yourself:

Can two instances run simultaneously? If yes, is that safe?
Do you read from replicas in write-sensitive flows? Replica lag is real.
Is your retry logic bounded? Unbounded retries compound failures.
Would you know if a job ran twice? Add a job execution log with server ID and timestamp.

Distributed locks aren't glamorous. Neither is spending three weeks rebuilding your domain reputation. One Redis SET NX EX would have cost nothing. The race condition cost us three weeks, a domain blacklisting, and a noticeable churn spike. Add the lock before you think you need it.