We Deployed on a Friday. Here's What Happened Next.
It was 4:30 PM on a Friday. The ticket was small — a config change, two lines of YAML, a deployment that had worked in staging three times. My lead signed off. I clicked deploy. By 10 PM I was still at my desk, and the on-call phone had rung seven times.
That was four years ago. I haven't deployed to production on a Friday since. Not because I became superstitious, but because that night taught me something I couldn't learn from documentation: automation isn't just about speed — it's about making human mistakes structurally impossible.
The Setup
We were running a mid-size SaaS platform — about 40,000 active users, a PHP monolith slowly
being strangled by a growing set of Node microservices, all glued together behind an Nginx
reverse proxy. Our CI/CD pipeline was... functional. GitHub push triggered a Jenkins build,
tests ran, Docker image got tagged, a shell script SSHed into the production box and ran
docker-compose up -d. Artisanal. Lovingly hand-crafted. Deeply fragile.
The change I was deploying: a new environment variable that pointed our notification service
at a different queue endpoint. Staging worked. UAT worked. The variable was in the
.env.production file. What could go wrong?
BEFORE DEPLOY (Expected)
─────────────────────────────────────────────────────
GitHub Push
│
▼
Jenkins Build
│
▼
Run Tests ──── PASS
│
▼
Build Docker Image
│
▼
SSH → docker-compose up -d
│
▼
✅ Done. Go home.
WHAT ACTUALLY HAPPENED
─────────────────────────────────────────────────────
GitHub Push
│
▼
Jenkins Build
│
▼
Run Tests ──── PASS (tests don't read .env.production)
│
▼
Build Docker Image (image bakes in OLD env snapshot)
│
▼
SSH → docker-compose up -d
│
▼
New container starts with MISSING env var
│
▼
Notification service silently swallows queue errors
│
▼
😱 Users stop receiving emails. Nobody knows yet.
The Silence That Screams
The worst production incidents aren't the loud ones. The loud ones — 500 errors, crashes, pages going white — those get caught immediately. Alerts fire, users complain, you know within minutes.
This was the other kind. Everything looked fine. The deploy succeeded. Green checkmark in Jenkins. Response times normal. Error rate: zero. CPU: nominal. The notification service was running. It was just... quietly not delivering anything, logging the failures to a file nobody was watching, and returning success codes anyway because the original developer had wrapped the queue call in a broad try-catch that ate the exception.
// What we had
async function enqueueNotification(payload) {
try {
await queueClient.send(payload);
return { success: true };
} catch (err) {
// TODO: add proper error handling
logger.warn('Queue send failed', err.message);
return { success: true }; // ← lied about success to not break callers
}
}
// What we needed
async function enqueueNotification(payload) {
const result = await queueClient.send(payload); // let it throw
metrics.increment('queue.send.success');
return result;
}
// And a dead letter queue handler that actually alerts:
queueClient.on('error', (err) => {
metrics.increment('queue.send.error');
alerts.fire('QUEUE_SEND_FAILURE', { err, severity: 'critical' });
});
We discovered the issue at 7:15 PM — not from monitoring, but because a user emailed support saying their password reset link never arrived. Support checked three more accounts. Same story. Someone pinged me. I checked the logs. My stomach dropped.
"Emails have been failing since 4:47 PM. That's two hours and twenty-eight minutes of silent failure, approximately 1,400 undelivered notifications, and zero alerts fired."
The Rollback That Wasn't
Here's where it got worse. Our "rollback" procedure was: SSH into the box, pull the previous Docker image tag, run docker-compose up again. Thirty seconds, right?
Except the previous image tag was latest. We hadn't been tagging images with
commit SHAs. The "previous" image was whatever had been in the registry before the build —
which turned out to be a build from three weeks ago that had a different database migration
state.
OUR "ROLLBACK" PROCESS ────────────────────────────────────────────────── Tag Strategy: latest ← overwrites on every build Timeline: Week 1 [build] → :latest (v1) Week 2 [build] → :latest (v2, overwrites v1) Week 3 [build] → :latest (v3, overwrites v2) Friday [build] → :latest (v4, broken) Rollback attempt → pulls :latest → gets v4 (same broken build) ❌ No previous image available. Rollback impossible. WHAT WE SHOULD HAVE HAD ────────────────────────────────────────────────── Tag Strategy: commit SHA + semver + latest alias [build] → :abc1234 + :v2.4.1 + :latest [build] → :def5678 + :v2.4.2 + :latest Rollback → docker pull myapp:abc1234 ✅ Any previous version instantly available.
We ended up doing a manual hotfix — patched the env var directly on the server, restarted the container, verified notifications were flowing. Six hours from deploy to resolution.
The Rebuild
The following week, I rewrote the entire pipeline. Not because anyone asked me to — because I couldn't sleep knowing it could happen again. Here's what changed:
- Image tagging: Every build tagged with
git rev-parse --short HEAD. The:latesttag still exists but is an alias, never the only tag. - Environment validation: A startup script that reads a
.env.requiredmanifest and fails loudly if any variable is missing or empty before the app boots. - Deployment windows: A GitHub Actions check that fails the deploy job if the current time is Friday after 3 PM or Saturday/Sunday. Enforced, not advisory.
- Smoke tests post-deploy: After every deploy, a script hits 12 critical endpoints and checks response codes + response shape. If anything fails, auto-rollback triggers.
- Dead letter queues with alerts: Any queue failure now fires a PagerDuty alert within 60 seconds, not 2.5 hours.
jobs:
check-deploy-window:
runs-on: ubuntu-latest
steps:
- name: Enforce deployment window
run: |
DAY=$(date +%u) # 1=Mon ... 7=Sun
HOUR=$(date +%H) # 00-23 UTC (adjust for your TZ)
if [ "$DAY" -ge 5 ] && [ "$HOUR" -ge 10 ]; then
echo "❌ Deploys blocked: Friday after 3 PM IST or weekend."
echo " Open a break-glass PR to override (requires 2 approvals)."
exit 1
fi
echo "✅ Deploy window is open."
deploy:
needs: [check-deploy-window, test]
runs-on: ubuntu-latest
steps:
- name: Build and tag image
run: |
SHA=$(git rev-parse --short HEAD)
docker build -t myapp:$SHA -t myapp:latest .
docker push myapp:$SHA
docker push myapp:latest
echo "IMAGE_TAG=$SHA" >> $GITHUB_ENV
- name: Deploy
run: ./scripts/deploy.sh ${{ env.IMAGE_TAG }}
- name: Smoke test
run: ./scripts/smoke-test.sh
# On failure, this job fails and deploy.sh rollback hook fires
What I Learned That Couldn't Come From a Book
The Friday rule gets mocked. Engineers call it superstition. It isn't. It's a forcing function — it makes you ask "is this urgent enough to deploy right now, or can it wait until Monday?" Almost always, the answer is Monday. And if it genuinely can't wait, you have the break-glass process. You've made the risk explicit and required two people to agree on it.
The silent failure pattern is the dangerous one. Noisy failures are healthy — they're feedback. Silent failures erode trust in your system and are three times harder to diagnose because by the time you find them, you've lost the causal proximity to the change that caused them.
And rollbacks are only real if you can execute them in under five minutes without consulting a runbook. If your rollback procedure requires careful thought, it will fail you under pressure.