Our AI Rate Limiter Decided 0 Requests Per Second Was the Correct Limit at 2 AM
At 2:09 AM on a Saturday, our on-call phone started ringing. PagerDuty: API error rate 98%.
Within 30 seconds it was 100%. Every single request to our public API was returning 429.
Not throttled — blocked entirely. The rate limiter logs showed the active limit:
requests_per_second: 0. Our AI-powered adaptive rate limiter had decided
the correct number of requests to allow was zero. It had been wrong about what it was protecting us from.
The System We Built
Three months earlier we had shipped what felt like an elegant solution to a real problem. Static rate limits are too conservative during normal traffic and too permissive during genuine attacks. We wanted limits that adapted automatically: tighten when traffic looked anomalous, relax when it looked normal.
The system worked like this: every 60 seconds, a Lambda function pulled the last 10 minutes
of API traffic metrics — request volume, error rates, unique IPs, request distribution by
endpoint — and sent them to GPT-4 with a prompt that asked it to classify the traffic as
normal, elevated, suspicious, or attack,
and recommend a rate limit adjustment.
ADAPTIVE RATE LIMITER ARCHITECTURE
──────────────────────────────────────────────────────────────
API Gateway → CloudWatch metrics (rolling 10m window)
│
▼ (every 60s)
Lambda: pull metrics
│
▼
GPT-4 classification prompt
┌──────────────────────────────┐
│ Traffic data: { │
│ rps: 847, │
│ unique_ips: 12, │
│ top_endpoint: /sync (94%) │
│ error_rate: 0.3% │
│ } │
│ Classify and recommend limit │
└──────────────────────────────┘
│
▼
Apply limit to API Gateway usage plan
(via AWS SDK UpdateUsagePlan)
Normal: → maintain current limit (1000 rps)
Elevated: → reduce 20% (800 rps)
Suspicious: → reduce 50% (500 rps)
Attack: → reduce 90% (100 rps)
It had worked well for 11 weeks. It had correctly tightened limits during two genuine scraping attempts and loosened them after a product launch traffic spike it correctly identified as legitimate. We were proud of it. We stopped watching it closely.
What Happened at 2:07 AM
Our mobile app has a scheduled background sync that runs at 2:00 AM in each timezone. It uploads user activity data, pulls down fresh content, and reconciles local state. For users in UTC+0 through UTC+5:30, that meant a concentrated burst of sync requests hitting our API between 2:00 AM and 2:15 AM UTC.
That night, we had pushed a mobile app update that expanded the sync payload — more endpoints hit, larger request bodies, higher request count per device. The burst was roughly 4x larger than the previous Saturday's 2 AM window. From the outside it looked like this:
{
"window": "2026-03-21T01:57:00Z to 2026-03-21T02:07:00Z",
"requests_per_second": 3847,
"vs_previous_saturday_same_window": "+412%",
"unique_ips": 8,
"unique_user_agents": 3,
"top_endpoint_pct": 94,
"top_endpoint": "/api/v2/sync",
"error_rate_pct": 0.2,
"avg_response_ms": 340,
"geographic_concentration": "high — 91% from 3 ASNs"
}
To a human with context, this is obviously a mobile sync burst: high volume, low unique IPs (they're NATed mobile users), concentrated on one endpoint, low error rate, consistent user agents. To GPT-4, given only the numbers and no knowledge of our mobile app's sync schedule, it looked like a textbook DDoS signature: massive volume spike, low IP diversity, single-endpoint concentration, coordinated timing.
{
"classification": "attack",
"confidence": 0.91,
"reasoning": "Traffic volume increased 412% versus baseline with high geographic and endpoint concentration. Low unique IP count with high request volume is consistent with botnet or amplification attack pattern. Recommend immediate aggressive rate limiting.",
"recommended_rps": 0,
"recommended_action": "Block all traffic until manual review"
}
The Lambda applied the recommendation. UpdateUsagePlan set the limit to 0.
Every request from every user — mobile sync, web dashboard, third-party integrations — started
returning 429. At 2:09 AM, PagerDuty fired.
The Response and the 23-Minute Outage
The on-call engineer woke up, looked at the rate limiter dashboard, saw rps: 0,
and immediately knew something was wrong. We had no manual override button — the assumption
had been that AI-managed limits wouldn't need one. Restoring the limit required finding the
Lambda, understanding the API Gateway usage plan structure, and manually calling
UpdateUsagePlan with a sane value.
That took 23 minutes at 2 AM. Full API outage for 23 minutes on a Saturday night. 1,847 mobile sync jobs failed and had to retry. 340 active web sessions got error screens. Three enterprise integrations logged failures and sent their on-call alerts.
What We Changed
We made four changes before re-enabling the system.
Hard floor on rate limit reductions. No single automated decision can reduce the limit by more than 70%. If the model recommends 0, it gets capped at 30% of current. Humans get paged for anything beyond that.
const MAX_SINGLE_REDUCTION = 0.70; // never cut more than 70% in one step
function applyRateLimitRecommendation(current: number, recommended: number): number {
const floor = Math.ceil(current * (1 - MAX_SINGLE_REDUCTION));
if (recommended < floor) {
logger.warn({ current, recommended, floor }, 'Rate limit recommendation exceeded safety floor — capping');
notifyOnCall('Rate limiter wanted to set RPS below safety floor. Capped. Manual review needed.');
return floor;
}
return recommended;
}
Known traffic patterns as context. The prompt now includes a schedule of known high-volume events: mobile sync windows, nightly batch jobs, weekly report generation. GPT-4 can now reason about whether a spike matches a known pattern before classifying it.
Manual override endpoint. A single authenticated API call can restore the rate limit to a safe default in under 2 seconds. The on-call runbook now starts with this step.
Staged rollout of limit changes. Reductions now happen in steps: 20% now, wait 2 minutes, assess again, reduce further if still anomalous. Zero can never be reached in a single step regardless of what the model recommends.
Lessons
1. AI systems acting on incomplete context will make confident wrong decisions. GPT-4 was not wrong about the traffic pattern. It was wrong about what was causing it — because we gave it metrics but not meaning. Business context (scheduled syncs, known patterns, recent deploys) is not optional for AI systems making real-time infrastructure decisions.
2. Every AI-driven system that can cause an outage needs a manual override reachable in under 60 seconds. We had no override because we assumed the AI would be right. That assumption cost us 23 minutes. Design for AI failure first.
3. Hard limits and floors are not an admission of failure — they are the architecture.
The model's recommendation is an input. Your safety rails are the system. An AI recommending
rps: 0 should never result in rps: 0 without a human in the loop.
Build that constraint in before you ship, not after your first 2 AM incident.