How a DigitalOcean Firewall Rule Silently Dropped 23% of Production Traffic for 11 Days
← Back
March 13, 2026Docker9 min read

How a DigitalOcean Firewall Rule Silently Dropped 23% of Production Traffic for 11 Days

Published March 13, 20269 min read

For 11 days, roughly 1 in 4 users hitting our platform got a timeout instead of a response. CPU was at 18%. Memory at 34%. Nginx access logs showed nothing unusual. Error rate in our APM: 0.2%, well within normal. The requests that were failing never reached our servers at all — they were being silently discarded by a firewall rule I had written myself, three weeks earlier, while setting up a new Droplet.

Production Failure

The symptom reports started on a Wednesday. Sporadic — some users would get a timeout, refresh, and it would load fine. Support received 14 tickets over two days, all variations of "the site is slow sometimes." Not slow enough to trigger our synthetic monitoring (which tested every 5 minutes from a fixed IP). Not consistent enough to reproduce on demand.

The affected users had one thing in common we didn't notice for a week: they were all on mobile carriers. Specifically, mobile carriers that use CGNAT (Carrier-Grade NAT) — a technique where thousands of users share a single public IP, using high-numbered ephemeral source ports (typically 32768–60999) to distinguish connections.

11 days issue in production
23% of requests silently dropped
0 server-side errors logged
~4,300 affected sessions estimated

False Assumptions: Everything We Blamed First

Week one was a tour through every wrong answer:

  • Nginx worker connections — checked worker_processes and worker_connections, both correctly sized for our load. Active connections never exceeded 40% of capacity.
  • Docker networking — suspected the bridge network between containers was dropping packets under load. Added inter-container latency metrics. Clean.
  • Node.js keep-alive misconfiguration — the API server's keep-alive timeout was shorter than the load balancer's, a known source of premature connection resets. Corrected it. Timeouts continued.
  • DigitalOcean Droplet network limits — checked bandwidth, PPS limits. Nothing close to the ceiling.

Every metric we knew how to measure was normal. The problem was we were measuring the server — and the problem was happening before packets reached the server.

Finding It: tcpdump at the Edge

The break came from running tcpdump directly on the Droplet's public interface while a user on a mobile connection reported a timeout in real time (a support call we orchestrated). The TCP SYN packet from the user arrived at the Droplet's NIC. No SYN-ACK was sent back. The connection was terminated at the OS level before Nginx saw it.

That pointed squarely at the firewall. I opened the DigitalOcean Cloud Firewall rules for the Droplet and stared at something I had written three weeks earlier:

DigitalOcean Cloud Firewall — inbound rules (the broken config)
# What I intended: allow inbound HTTP and HTTPS from anywhere
Inbound Rules:
  TCP  port 80   sources: All IPv4, All IPv6   ✓
  TCP  port 443  sources: All IPv4, All IPv6   ✓
  TCP  port 22   sources: [my office IP]       ✓

# What I had actually added while "cleaning up" the rules:
  TCP  port 1024-65535  sources: [specific IP range]   ← this one

# What that rule does:
# DigitalOcean Cloud Firewalls are STATELESS for inbound rules.
# A TCP reply from the server back to a CGNAT client uses the client's
# ephemeral source port as the DESTINATION port on the return path.
# Ports 32768–60999 (CGNAT ephemeral range) fall within 1024–65535.
# The rule was RESTRICTING return traffic to a specific IP range,
# silently dropping TCP ACK/data packets back to mobile clients.

Root Cause: Stateless Firewall + CGNAT Ephemeral Ports

DigitalOcean Cloud Firewalls evaluate each packet independently — they don't track TCP connection state. This means an inbound rule covering port range 1024–65535 applies to incoming packets destined for those ports on the Droplet, but it also implicitly affects the return path of connections originating from clients whose source port falls in that range.

CGNAT clients use ephemeral ports in the 32768–60999 range as their source port. When our server sent a TCP response, the destination port was the client's source port — a port in the range the firewall rule was restricting. The firewall dropped the response. The client saw a timeout. The server logged nothing, because as far as it was concerned, it had sent the packet successfully.

  NORMAL CONNECTION (non-CGNAT client, source port 55000)
  ─────────────────────────────────────────────────────────────────

  Client              Firewall             Droplet / Nginx
  (port 55000)
      │                   │                     │
      │── SYN ──────────▶ │ port 443 ✓ ALLOW ──▶│
      │                   │                     │── SYN-ACK ──▶ return path
      │◀──────────────────┼─────────────────────│  (dst port 55000)
      │                   │                     │
      │  Connection established ✓               │


  CGNAT CLIENT (source port 44821 — within rule range 1024-65535)
  ─────────────────────────────────────────────────────────────────

  Client              Firewall             Droplet / Nginx
  (CGNAT port 44821)
      │                   │                     │
      │── SYN ──────────▶ │ port 443 ✓ ALLOW ──▶│
      │                   │                     │── SYN-ACK ──▶
      │                   │◀──── return packet ──│  (dst port 44821)
      │                   │                     │
      │                   │ port 44821 — matches rule 1024-65535
      │                   │ source: server IP — NOT in allowed range
      │                   │ DROPPED silently ✗  │
      │                   │                     │
      ×  Timeout after 30s │                     │
         (client never     │                     │
          gets SYN-ACK)    │                     │

Architecture Fix: Remove the Rule, Understand the Firewall Model

The fix was a single rule deletion. The rogue 1024–65535 rule had no legitimate purpose — it was added during a "security hardening" session where I misread DigitalOcean's documentation and confused their stateless Cloud Firewall with a stateful iptables setup.

We chose not to switch to a stateful firewall (iptables with conntrack inside the Droplet) because the Cloud Firewall is simpler to audit across multiple Droplets from a central place. Instead, we adopted a strict review process: every firewall change now requires a second engineer to read it against the DigitalOcean statefulness documentation before applying.

  FIREWALL RULE CHANGE PROCESS (after incident)
  ─────────────────────────────────────────────────────────────────

  Engineer proposes firewall change
          │
          ▼
  Document: What port/range? What source? What protocol?
          │
          ▼
  Ask: Is this Cloud Firewall (stateless) or iptables (stateful)?
          │
          ├── Cloud Firewall ──▶ Does this rule affect RETURN TRAFFIC
          │                      from legitimate connections?
          │                          │
          │                          ├── YES → redesign or use iptables
          │                          └── NO  → second engineer review → apply
          │
          └── iptables ──────▶ Standard review → apply


  CORRECTED INBOUND RULES (after fix)
  ─────────────────────────────────────────────────────────────────

  TCP  port 80    sources: All IPv4, All IPv6  ✓
  TCP  port 443   sources: All IPv4, All IPv6  ✓
  TCP  port 22    sources: [trusted IPs only]  ✓
  (no port-range rules — ever)

Lessons Learned

  • Stateless firewalls require a different mental model. iptables with conntrack tracks TCP state — an established connection's return traffic is automatically allowed. Cloud Firewalls evaluate every packet in isolation. A port-range rule that looks like "allow high ports inbound" is also a rule about return-path traffic to those ports.
  • CGNAT is the new normal for mobile. A large fraction of mobile users share IPs and use high ephemeral ports. Any firewall rule touching 1024–65535 will affect them disproportionately.
  • Silent drops are invisible to server-side monitoring. Packets killed before they reach the NIC never appear in Nginx logs, APM, or error tracking. Add client-side error monitoring (JS error boundaries, mobile crash reporting) that captures network-level failures.
  • tcpdump at the NIC is the last-resort oracle. When server logs show nothing and clients report timeouts, run tcpdump on the Droplet's public interface during a live failure — if the SYN arrives and no SYN-ACK leaves, the OS or firewall is the suspect, not the application.
— It wasn't the server. It was never the server. It was the thing in front of the server.
Share this
← All Posts9 min read