How a Race Condition in Our Cron Job Sent 2.3 Million Duplicate Emails in One Night
At 6:47 AM, our support inbox had 4,800 unread tickets. Every single one was a variation of the same message: "Why did I receive the same email 11 times?" By 7:15 AM, SendGrid had suspended our account. By 8:00 AM, our domain was on three major blacklists. The cause? A cron job. Running on two servers. With no lock. For six months.
We hadn't noticed because most nights it didn't matter. The race window was about 200ms — just long enough to be catastrophic on a night when both servers happened to boot the job at exactly the same second.
The Architecture (Such As It Was)
Our platform sent a nightly email digest to around 210,000 active users — a summary of activity in their workspace over the previous 24 hours. The job was a Python Flask task that ran at 2:00 AM UTC via cron, queried Postgres for users with pending activity, and fired off emails through SendGrid's batch API.
Six months earlier, we'd horizontally scaled our application tier from one server to two for redundancy. We added the cron job to both servers' crontabs — because "if one goes down, the other keeps running." This logic was correct in spirit. Catastrophic in practice.
THE SETUP (simplified)
─────────────────────────────────────────────────────
Server A (cron) Server B (cron)
│ │
│ 2:00:00 AM UTC │
├───────────────────────►│ Both fire simultaneously
│ │
▼ ▼
SELECT users WHERE SELECT users WHERE
has_activity = true has_activity = true
│ │
│ ← Same 210,847 rows ──┤
│ │
▼ ▼
Send 210,847 emails Send 210,847 emails
Result: 421,694 emails sent (2x)
But on the bad night:
Retry logic + partial failures = 11x per user
Total: ~2.3 million emails
Why It Was Fine Until It Wasn't
On most nights, Server A won the race by a few hundred milliseconds — it marked users'
digests as sent = true in Postgres before Server B's SELECT
ran. Server B would query an empty result set and exit cleanly. Total duplicate emails:
zero. We had no idea we were one slow database query away from disaster.
The night it failed, our primary Postgres replica was lagging behind by about 3 seconds due to an unrelated heavy analytics query. Server A marked users as sent — but Server B was reading from the lagging replica. It saw all 210,847 users as unsent. Both servers sent the full digest. Then our retry logic kicked in for the ~8% of emails that soft-failed on the first pass. Then the retry of the retry. By the time the job finished at 3:40 AM, every user had received between 9 and 14 emails.
WHAT ACTUALLY HAPPENED
─────────────────────────────────────────────────────
2:00:00 AM Server A starts, reads 210,847 users (primary)
2:00:00 AM Server B starts, reads 210,847 users (lagging replica)
2:00:03 AM Server A marks users sent=true (primary)
2:00:03 AM Server B still sees sent=false (replica 3s behind)
2:00:04 AM Both servers begin sending → 421,694 emails
2:00:30 AM ~8% soft failures on both servers
2:00:31 AM Retry #1 → another 33,735 emails per server
2:01:00 AM Retry #2 fires for partial failures
... continues until 3:40 AM ...
Total emails sent: ~2.3 million
Users affected: 210,847
Average per user: 10.9 emails
The Discovery
I woke up to a Slack message from our CTO at 6:52 AM: "Have you seen support?" I hadn't. I opened the dashboard. We had a 4,847-ticket queue that hadn't existed when I went to bed. SendGrid's dashboard showed our bounce rate had spiked to 23% — anything over 5% is a red flag. Our account was already suspended pending review.
The first thing I did was grep the logs. The cron job timestamps told the story immediately:
# Server A logs
[2026-03-13 02:00:00] Starting nightly digest job
[2026-03-13 02:00:01] Querying users with pending activity...
[2026-03-13 02:00:01] Found 210,847 users. Beginning send.
[2026-03-13 02:00:03] Marked 210,847 users as sent=true
[2026-03-13 03:38:12] Job complete. Sent: 198,203. Failed: 12,644. Retried: 12,644.
# Server B logs (same timestamps)
[2026-03-13 02:00:00] Starting nightly digest job
[2026-03-13 02:00:01] Querying users with pending activity...
[2026-03-13 02:00:01] Found 210,847 users. Beginning send.
[2026-03-13 02:00:05] Marked 210,847 users as sent=true
[2026-03-13 03:40:31] Job complete. Sent: 199,041. Failed: 11,806. Retried: 11,806.
Both servers, same second, same user count. The replica lag meant neither could protect the other. Six months of luck, gone in one bad night.
The Fix: Distributed Locking
The correct solution is a distributed lock — a mechanism that ensures only one instance of a job runs at a time across all servers. We implemented it using Redis with a TTL-based lock pattern:
import redis
import time
import uuid
redis_client = redis.Redis(host='your-redis-host', decode_responses=True)
LOCK_KEY = 'cron:nightly_digest:lock'
LOCK_TTL = 7200 # 2 hours — max expected runtime + buffer
def run_nightly_digest():
lock_id = str(uuid.uuid4())
# SET NX EX: Set only if Not eXists, with EXpiry
acquired = redis_client.set(LOCK_KEY, lock_id, nx=True, ex=LOCK_TTL)
if not acquired:
print(f"Lock already held. Another instance is running. Exiting.")
return
print(f"Lock acquired: {lock_id}")
try:
# ✅ Only ONE server will ever reach this point
users = get_users_with_pending_activity()
send_digest_emails(users)
mark_users_sent(users)
finally:
# Release ONLY our lock (don't release if someone else somehow holds it)
release_script = """
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('del', KEYS[1])
else
return 0
end
"""
redis_client.eval(release_script, 1, LOCK_KEY, lock_id)
print(f"Lock released: {lock_id}")
run_nightly_digest()
The key details here:
SET NX EXis atomic — no race condition between checking and setting the lock- Lock includes a unique ID — so we only release locks we actually own
- Lua script for release — check-and-delete is atomic; prevents releasing another server's lock if our TTL expired
- TTL is generous — set it to 2x your worst-case runtime so a crashed server doesn't hold the lock forever
The Second Fix: Read From Primary
The distributed lock would have prevented the disaster even with replica lag. But we also fixed the underlying fragility: the digest job now reads directly from the Postgres primary, not the replica. For a once-a-night job, the added primary load is negligible. For any write-sensitive read, you should always target primary.
# SQLAlchemy — explicitly target primary for write-sensitive reads
with engine.connect().execution_options(
postgresql_readonly=False # Forces primary even in read-only mode configs
) as conn:
users = conn.execute(
text("SELECT id, email FROM users WHERE digest_sent_today = false")
).fetchall()
The Third Fix: Idempotency at the Row Level
Even with a distributed lock and primary reads, we added one more layer of protection:
a database-level guard using UPDATE ... RETURNING to atomically claim users
before sending their email. This means even if two processes somehow get through, only
one can claim each user row.
-- Atomically claim users and return only unclaimed ones
-- No other process can claim the same rows simultaneously
UPDATE users
SET digest_claimed_at = NOW(), digest_claimed_by = 'server-a-job-id'
WHERE digest_sent_today = false
AND digest_claimed_at IS NULL
RETURNING id, email;
Damage Control
The technical fix took 40 minutes. The damage control took 3 weeks.
- SendGrid: We called their support line directly. They reviewed our account, confirmed it was a technical error (our historical stats were clean), and reinstated us within 6 hours.
- Blacklists: We submitted removal requests to Spamhaus, SORBS, and Barracuda. Spamhaus removed us in 48 hours. The others took 10–14 days. During that period, emails to some corporate inboxes (heavy Barracuda users) went straight to spam.
- Users: We sent one apology email (irony noted). We offered 1 month free. Churn that month was 2.1% vs our usual 0.4%.
- Domain reputation: Took about 3 weeks of low-volume, high-engagement sending to recover our sender score fully.
What to Watch For
If you're running scheduled jobs across multiple servers, ask yourself:
- Can two instances run simultaneously? If yes, is that safe?
- Do you read from replicas in write-sensitive flows? Replica lag is real.
- Is your retry logic bounded? Unbounded retries compound failures.
- Would you know if a job ran twice? Add a job execution log with server ID and timestamp.
Distributed locks aren't glamorous. Neither is spending three weeks rebuilding your domain
reputation. One Redis SET NX EX would have cost us nothing. The race condition
cost us three weeks, a domain blacklisting, and a noticeable churn spike.
Add the lock before you need it.