March 5, 2026CI/CD9 min read

We Deployed on a Friday. Here's What Happened Next.

Published March 5, 20269 min read

4:30 PM, Friday. The ticket was small. A config change, two lines of YAML, a deployment that had worked in staging three times. My lead signed off. I clicked deploy. By 10 PM I was still at my desk and the on-call phone had rung seven times.

That was four years ago. I haven't deployed to production on a Friday since.

The setup

Mid-size SaaS platform. About 40,000 active users, a PHP monolith slowly being strangled by a growing set of Node microservices, glued together behind an Nginx reverse proxy. Our CI/CD pipeline was functional in the way that a 1997 Toyota is functional. GitHub push triggered a Jenkins build, tests ran, Docker image got tagged, a shell script SSHed into the production box and ran docker-compose up -d. Artisanal. Deeply fragile.

The change I was deploying was a new environment variable pointing the notification service at a different queue endpoint. Staging worked. UAT worked. The variable was in .env.production. What could go wrong.

  BEFORE DEPLOY (Expected)
  ─────────────────────────────────────────────────────
  GitHub Push
       │
       ▼
  Jenkins Build
       │
       ▼
  Run Tests ──── PASS
       │
       ▼
  Build Docker Image
       │
       ▼
  SSH → docker-compose up -d
       │
       ▼
  ✅  Done. Go home.


  WHAT ACTUALLY HAPPENED
  ─────────────────────────────────────────────────────
  GitHub Push
       │
       ▼
  Jenkins Build
       │
       ▼
  Run Tests ──── PASS (tests don't read .env.production)
       │
       ▼
  Build Docker Image (image bakes in OLD env snapshot)
       │
       ▼
  SSH → docker-compose up -d
       │
       ▼
  New container starts with MISSING env var
       │
       ▼
  Notification service silently swallows queue errors
       │
       ▼
  😱  Users stop receiving emails. Nobody knows yet.

The silence that screams

The worst production incidents aren't the loud ones. Loud ones (500s, crashes, pages going white) get caught immediately. Alerts fire, users complain, you know within minutes.

This was the other kind. Everything looked fine. Deploy succeeded. Green checkmark in Jenkins. Response times normal. Error rate zero. CPU nominal. The notification service was running. It was just quietly not delivering anything, logging failures to a file nobody was watching, and returning success codes anyway because the original developer had wrapped the queue call in a broad try-catch that ate the exception.

notification-service/queue.js — the silent killer

// What we had
async function enqueueNotification(payload) {
  try {
    await queueClient.send(payload);
    return { success: true };
  } catch (err) {
    // TODO: add proper error handling
    logger.warn('Queue send failed', err.message);
    return { success: true }; // ← lied about success to not break callers
  }
}

// What we needed
async function enqueueNotification(payload) {
  const result = await queueClient.send(payload); // let it throw
  metrics.increment('queue.send.success');
  return result;
}

// And a dead letter queue handler that actually alerts:
queueClient.on('error', (err) => {
  metrics.increment('queue.send.error');
  alerts.fire('QUEUE_SEND_FAILURE', { err, severity: 'critical' });
});

We discovered the issue at 7:15 PM. Not from monitoring. A user emailed support saying their password reset link never arrived. Support checked three more accounts. Same story. Someone pinged me. I checked the logs. Stomach dropped.

"Emails have been failing since 4:47 PM. That's two hours and twenty-eight minutes of silent failure, approximately 1,400 undelivered notifications, and zero alerts fired."

The rollback that wasn't

Here's where it got worse. Our rollback procedure was: SSH into the box, pull the previous Docker image tag, run docker-compose up again. Thirty seconds.

Except the previous image tag was latest. We hadn't been tagging images with commit SHAs. The "previous" image was whatever had been sitting in the registry before this build, which turned out to be a build from three weeks ago with a different database migration state. I'll pause there so you can appreciate that sentence.

  OUR "ROLLBACK" PROCESS
  ──────────────────────────────────────────────────
  Tag Strategy:   latest ← overwrites on every build

  Timeline:
  Week 1    [build] → :latest (v1)
  Week 2    [build] → :latest (v2, overwrites v1)
  Week 3    [build] → :latest (v3, overwrites v2)
  Friday    [build] → :latest (v4, broken)

  Rollback attempt → pulls :latest → gets v4 (same broken build)
  
  ❌ No previous image available. Rollback impossible.


  WHAT WE SHOULD HAVE HAD
  ──────────────────────────────────────────────────
  Tag Strategy:   commit SHA + semver + latest alias

  [build] → :abc1234 + :v2.4.1 + :latest
  [build] → :def5678 + :v2.4.2 + :latest

  Rollback → docker pull myapp:abc1234
  
  ✅ Any previous version instantly available.

We ended up doing a manual hotfix. Patched the env var directly on the server, restarted the container, verified notifications were flowing. Six hours from deploy to resolution.

The rebuild

The following week, I rewrote the entire pipeline. Not because anyone asked. Because I couldn't sleep knowing it could happen again. What changed:

Image tagging: every build gets tagged with git rev-parse --short HEAD. The :latest tag still exists but only as an alias.
Environment validation: a startup script reads a .env.required manifest and fails loudly if any variable is missing or empty before the app boots.
Deployment windows: a GitHub Actions check fails the deploy job if the current time is Friday after 3 PM or the weekend. Enforced, not advisory.
Post-deploy smoke tests: a script hits 12 critical endpoints and checks response codes and response shape. Any failure triggers auto-rollback.
Dead letter queues with alerts: any queue failure now fires a PagerDuty alert inside 60 seconds instead of 2.5 hours.

.github/workflows/deploy.yml — deployment gate

jobs:
  check-deploy-window:
    runs-on: ubuntu-latest
    steps:
      - name: Enforce deployment window
        run: |
          DAY=$(date +%u)   # 1=Mon ... 7=Sun
          HOUR=$(date +%H)  # 00-23 UTC (adjust for your TZ)
          
          if [ "$DAY" -ge 5 ] && [ "$HOUR" -ge 10 ]; then
            echo "❌ Deploys blocked: Friday after 3 PM IST or weekend."
            echo "   Open a break-glass PR to override (requires 2 approvals)."
            exit 1
          fi
          echo "✅ Deploy window is open."

  deploy:
    needs: [check-deploy-window, test]
    runs-on: ubuntu-latest
    steps:
      - name: Build and tag image
        run: |
          SHA=$(git rev-parse --short HEAD)
          docker build -t myapp:$SHA -t myapp:latest .
          docker push myapp:$SHA
          docker push myapp:latest
          echo "IMAGE_TAG=$SHA" >> $GITHUB_ENV

      - name: Deploy
        run: ./scripts/deploy.sh ${{ env.IMAGE_TAG }}

      - name: Smoke test
        run: ./scripts/smoke-test.sh
        # On failure, this job fails and deploy.sh rollback hook fires

What I learned that couldn't come from a book

1,400+ undelivered notifications

6h time to resolution

0 alerts fired during incident

2 weeks to rebuild the pipeline properly

The Friday rule gets mocked. Engineers call it superstition. It isn't. It's a forcing function. It makes you ask "is this urgent enough to deploy now, or can it wait until Monday?" Almost always, the answer is Monday. If it genuinely can't wait, you have the break-glass process, and you've made the risk explicit.

Silent failures are the dangerous ones. Noisy failures are feedback. Silent failures erode trust in your system and are harder to diagnose because by the time you find them, you've lost proximity to the change that caused them.

Rollbacks are only real if you can execute them in under five minutes without a runbook. If your rollback procedure requires careful thought, it will fail you under pressure.