We Deployed on a Friday. Here's What Happened Next.
← Back
March 5, 2026CI/CD9 min read

We Deployed on a Friday. Here's What Happened Next.

Published March 5, 20269 min read

It was 4:30 PM on a Friday. The ticket was small — a config change, two lines of YAML, a deployment that had worked in staging three times. My lead signed off. I clicked deploy. By 10 PM I was still at my desk, and the on-call phone had rung seven times.

That was four years ago. I haven't deployed to production on a Friday since. Not because I became superstitious, but because that night taught me something I couldn't learn from documentation: automation isn't just about speed — it's about making human mistakes structurally impossible.

The Setup

We were running a mid-size SaaS platform — about 40,000 active users, a PHP monolith slowly being strangled by a growing set of Node microservices, all glued together behind an Nginx reverse proxy. Our CI/CD pipeline was... functional. GitHub push triggered a Jenkins build, tests ran, Docker image got tagged, a shell script SSHed into the production box and ran docker-compose up -d. Artisanal. Lovingly hand-crafted. Deeply fragile.

The change I was deploying: a new environment variable that pointed our notification service at a different queue endpoint. Staging worked. UAT worked. The variable was in the .env.production file. What could go wrong?

  BEFORE DEPLOY (Expected)
  ─────────────────────────────────────────────────────
  GitHub Push
       │
       ▼
  Jenkins Build
       │
       ▼
  Run Tests ──── PASS
       │
       ▼
  Build Docker Image
       │
       ▼
  SSH → docker-compose up -d
       │
       ▼
  ✅  Done. Go home.


  WHAT ACTUALLY HAPPENED
  ─────────────────────────────────────────────────────
  GitHub Push
       │
       ▼
  Jenkins Build
       │
       ▼
  Run Tests ──── PASS (tests don't read .env.production)
       │
       ▼
  Build Docker Image (image bakes in OLD env snapshot)
       │
       ▼
  SSH → docker-compose up -d
       │
       ▼
  New container starts with MISSING env var
       │
       ▼
  Notification service silently swallows queue errors
       │
       ▼
  😱  Users stop receiving emails. Nobody knows yet.

The Silence That Screams

The worst production incidents aren't the loud ones. The loud ones — 500 errors, crashes, pages going white — those get caught immediately. Alerts fire, users complain, you know within minutes.

This was the other kind. Everything looked fine. The deploy succeeded. Green checkmark in Jenkins. Response times normal. Error rate: zero. CPU: nominal. The notification service was running. It was just... quietly not delivering anything, logging the failures to a file nobody was watching, and returning success codes anyway because the original developer had wrapped the queue call in a broad try-catch that ate the exception.

notification-service/queue.js — the silent killer
// What we had
async function enqueueNotification(payload) {
  try {
    await queueClient.send(payload);
    return { success: true };
  } catch (err) {
    // TODO: add proper error handling
    logger.warn('Queue send failed', err.message);
    return { success: true }; // ← lied about success to not break callers
  }
}

// What we needed
async function enqueueNotification(payload) {
  const result = await queueClient.send(payload); // let it throw
  metrics.increment('queue.send.success');
  return result;
}

// And a dead letter queue handler that actually alerts:
queueClient.on('error', (err) => {
  metrics.increment('queue.send.error');
  alerts.fire('QUEUE_SEND_FAILURE', { err, severity: 'critical' });
});

We discovered the issue at 7:15 PM — not from monitoring, but because a user emailed support saying their password reset link never arrived. Support checked three more accounts. Same story. Someone pinged me. I checked the logs. My stomach dropped.

"Emails have been failing since 4:47 PM. That's two hours and twenty-eight minutes of silent failure, approximately 1,400 undelivered notifications, and zero alerts fired."

The Rollback That Wasn't

Here's where it got worse. Our "rollback" procedure was: SSH into the box, pull the previous Docker image tag, run docker-compose up again. Thirty seconds, right?

Except the previous image tag was latest. We hadn't been tagging images with commit SHAs. The "previous" image was whatever had been in the registry before the build — which turned out to be a build from three weeks ago that had a different database migration state.

  OUR "ROLLBACK" PROCESS
  ──────────────────────────────────────────────────
  Tag Strategy:   latest ← overwrites on every build

  Timeline:
  Week 1    [build] → :latest (v1)
  Week 2    [build] → :latest (v2, overwrites v1)
  Week 3    [build] → :latest (v3, overwrites v2)
  Friday    [build] → :latest (v4, broken)

  Rollback attempt → pulls :latest → gets v4 (same broken build)
  
  ❌ No previous image available. Rollback impossible.


  WHAT WE SHOULD HAVE HAD
  ──────────────────────────────────────────────────
  Tag Strategy:   commit SHA + semver + latest alias

  [build] → :abc1234 + :v2.4.1 + :latest
  [build] → :def5678 + :v2.4.2 + :latest

  Rollback → docker pull myapp:abc1234
  
  ✅ Any previous version instantly available.

We ended up doing a manual hotfix — patched the env var directly on the server, restarted the container, verified notifications were flowing. Six hours from deploy to resolution.

The Rebuild

The following week, I rewrote the entire pipeline. Not because anyone asked me to — because I couldn't sleep knowing it could happen again. Here's what changed:

  • Image tagging: Every build tagged with git rev-parse --short HEAD. The :latest tag still exists but is an alias, never the only tag.
  • Environment validation: A startup script that reads a .env.required manifest and fails loudly if any variable is missing or empty before the app boots.
  • Deployment windows: A GitHub Actions check that fails the deploy job if the current time is Friday after 3 PM or Saturday/Sunday. Enforced, not advisory.
  • Smoke tests post-deploy: After every deploy, a script hits 12 critical endpoints and checks response codes + response shape. If anything fails, auto-rollback triggers.
  • Dead letter queues with alerts: Any queue failure now fires a PagerDuty alert within 60 seconds, not 2.5 hours.
.github/workflows/deploy.yml — deployment gate
jobs:
  check-deploy-window:
    runs-on: ubuntu-latest
    steps:
      - name: Enforce deployment window
        run: |
          DAY=$(date +%u)   # 1=Mon ... 7=Sun
          HOUR=$(date +%H)  # 00-23 UTC (adjust for your TZ)
          
          if [ "$DAY" -ge 5 ] && [ "$HOUR" -ge 10 ]; then
            echo "❌ Deploys blocked: Friday after 3 PM IST or weekend."
            echo "   Open a break-glass PR to override (requires 2 approvals)."
            exit 1
          fi
          echo "✅ Deploy window is open."

  deploy:
    needs: [check-deploy-window, test]
    runs-on: ubuntu-latest
    steps:
      - name: Build and tag image
        run: |
          SHA=$(git rev-parse --short HEAD)
          docker build -t myapp:$SHA -t myapp:latest .
          docker push myapp:$SHA
          docker push myapp:latest
          echo "IMAGE_TAG=$SHA" >> $GITHUB_ENV

      - name: Deploy
        run: ./scripts/deploy.sh ${{ env.IMAGE_TAG }}

      - name: Smoke test
        run: ./scripts/smoke-test.sh
        # On failure, this job fails and deploy.sh rollback hook fires

What I Learned That Couldn't Come From a Book

1,400+ undelivered notifications
6h time to resolution
0 alerts fired during incident
2 weeks to rebuild the pipeline properly

The Friday rule gets mocked. Engineers call it superstition. It isn't. It's a forcing function — it makes you ask "is this urgent enough to deploy right now, or can it wait until Monday?" Almost always, the answer is Monday. And if it genuinely can't wait, you have the break-glass process. You've made the risk explicit and required two people to agree on it.

The silent failure pattern is the dangerous one. Noisy failures are healthy — they're feedback. Silent failures erode trust in your system and are three times harder to diagnose because by the time you find them, you've lost the causal proximity to the change that caused them.

And rollbacks are only real if you can execute them in under five minutes without consulting a runbook. If your rollback procedure requires careful thought, it will fail you under pressure.

— Built the hard way, so the next time doesn't have to be.
Share this
← All Posts9 min read