How a DigitalOcean Firewall Rule Silently Dropped 23% of Production Traffic for 11 Days
For 11 days, roughly 1 in 4 users hitting our platform got a timeout instead of a response. CPU was at 18%. Memory at 34%. Nginx access logs showed nothing unusual. Error rate in our APM: 0.2%, well within normal. The requests that were failing never reached our servers at all — they were being silently discarded by a firewall rule I had written myself, three weeks earlier, while setting up a new Droplet.
Production Failure
The symptom reports started on a Wednesday. Sporadic — some users would get a timeout, refresh, and it would load fine. Support received 14 tickets over two days, all variations of "the site is slow sometimes." Not slow enough to trigger our synthetic monitoring (which tested every 5 minutes from a fixed IP). Not consistent enough to reproduce on demand.
The affected users had one thing in common we didn't notice for a week: they were all on mobile carriers. Specifically, mobile carriers that use CGNAT (Carrier-Grade NAT) — a technique where thousands of users share a single public IP, using high-numbered ephemeral source ports (typically 32768–60999) to distinguish connections.
False Assumptions: Everything We Blamed First
Week one was a tour through every wrong answer:
- Nginx worker connections — checked
worker_processesandworker_connections, both correctly sized for our load. Active connections never exceeded 40% of capacity. - Docker networking — suspected the bridge network between containers was dropping packets under load. Added inter-container latency metrics. Clean.
- Node.js keep-alive misconfiguration — the API server's keep-alive timeout was shorter than the load balancer's, a known source of premature connection resets. Corrected it. Timeouts continued.
- DigitalOcean Droplet network limits — checked bandwidth, PPS limits. Nothing close to the ceiling.
Every metric we knew how to measure was normal. The problem was we were measuring the server — and the problem was happening before packets reached the server.
Finding It: tcpdump at the Edge
The break came from running tcpdump directly on the Droplet's public interface
while a user on a mobile connection reported a timeout in real time (a support call we
orchestrated). The TCP SYN packet from the user arrived at the Droplet's NIC.
No SYN-ACK was sent back. The connection was terminated at the OS level before Nginx saw it.
That pointed squarely at the firewall. I opened the DigitalOcean Cloud Firewall rules for the Droplet and stared at something I had written three weeks earlier:
# What I intended: allow inbound HTTP and HTTPS from anywhere
Inbound Rules:
TCP port 80 sources: All IPv4, All IPv6 ✓
TCP port 443 sources: All IPv4, All IPv6 ✓
TCP port 22 sources: [my office IP] ✓
# What I had actually added while "cleaning up" the rules:
TCP port 1024-65535 sources: [specific IP range] ← this one
# What that rule does:
# DigitalOcean Cloud Firewalls are STATELESS for inbound rules.
# A TCP reply from the server back to a CGNAT client uses the client's
# ephemeral source port as the DESTINATION port on the return path.
# Ports 32768–60999 (CGNAT ephemeral range) fall within 1024–65535.
# The rule was RESTRICTING return traffic to a specific IP range,
# silently dropping TCP ACK/data packets back to mobile clients.
Root Cause: Stateless Firewall + CGNAT Ephemeral Ports
DigitalOcean Cloud Firewalls evaluate each packet independently — they don't track TCP connection state. This means an inbound rule covering port range 1024–65535 applies to incoming packets destined for those ports on the Droplet, but it also implicitly affects the return path of connections originating from clients whose source port falls in that range.
CGNAT clients use ephemeral ports in the 32768–60999 range as their source port. When our server sent a TCP response, the destination port was the client's source port — a port in the range the firewall rule was restricting. The firewall dropped the response. The client saw a timeout. The server logged nothing, because as far as it was concerned, it had sent the packet successfully.
NORMAL CONNECTION (non-CGNAT client, source port 55000)
─────────────────────────────────────────────────────────────────
Client Firewall Droplet / Nginx
(port 55000)
│ │ │
│── SYN ──────────▶ │ port 443 ✓ ALLOW ──▶│
│ │ │── SYN-ACK ──▶ return path
│◀──────────────────┼─────────────────────│ (dst port 55000)
│ │ │
│ Connection established ✓ │
CGNAT CLIENT (source port 44821 — within rule range 1024-65535)
─────────────────────────────────────────────────────────────────
Client Firewall Droplet / Nginx
(CGNAT port 44821)
│ │ │
│── SYN ──────────▶ │ port 443 ✓ ALLOW ──▶│
│ │ │── SYN-ACK ──▶
│ │◀──── return packet ──│ (dst port 44821)
│ │ │
│ │ port 44821 — matches rule 1024-65535
│ │ source: server IP — NOT in allowed range
│ │ DROPPED silently ✗ │
│ │ │
× Timeout after 30s │ │
(client never │ │
gets SYN-ACK) │ │
Architecture Fix: Remove the Rule, Understand the Firewall Model
The fix was a single rule deletion. The rogue 1024–65535 rule had no legitimate
purpose — it was added during a "security hardening" session where I misread DigitalOcean's
documentation and confused their stateless Cloud Firewall with a stateful iptables setup.
We chose not to switch to a stateful firewall (iptables with
conntrack inside the Droplet) because the Cloud Firewall is simpler to
audit across multiple Droplets from a central place. Instead, we adopted a strict
review process: every firewall change now requires a second engineer to read it against
the DigitalOcean statefulness documentation before applying.
FIREWALL RULE CHANGE PROCESS (after incident)
─────────────────────────────────────────────────────────────────
Engineer proposes firewall change
│
▼
Document: What port/range? What source? What protocol?
│
▼
Ask: Is this Cloud Firewall (stateless) or iptables (stateful)?
│
├── Cloud Firewall ──▶ Does this rule affect RETURN TRAFFIC
│ from legitimate connections?
│ │
│ ├── YES → redesign or use iptables
│ └── NO → second engineer review → apply
│
└── iptables ──────▶ Standard review → apply
CORRECTED INBOUND RULES (after fix)
─────────────────────────────────────────────────────────────────
TCP port 80 sources: All IPv4, All IPv6 ✓
TCP port 443 sources: All IPv4, All IPv6 ✓
TCP port 22 sources: [trusted IPs only] ✓
(no port-range rules — ever)
Lessons Learned
- Stateless firewalls require a different mental model. iptables with
conntracktracks TCP state — an established connection's return traffic is automatically allowed. Cloud Firewalls evaluate every packet in isolation. A port-range rule that looks like "allow high ports inbound" is also a rule about return-path traffic to those ports. - CGNAT is the new normal for mobile. A large fraction of mobile users share IPs and use high ephemeral ports. Any firewall rule touching 1024–65535 will affect them disproportionately.
- Silent drops are invisible to server-side monitoring. Packets killed before they reach the NIC never appear in Nginx logs, APM, or error tracking. Add client-side error monitoring (JS error boundaries, mobile crash reporting) that captures network-level failures.
- tcpdump at the NIC is the last-resort oracle. When server logs show nothing and clients report timeouts, run tcpdump on the Droplet's public interface during a live failure — if the SYN arrives and no SYN-ACK leaves, the OS or firewall is the suspect, not the application.