March 27, 2026AI10 min read

Our AI Rate Limiter Decided 0 Requests Per Second Was the Correct Limit at 2 AM

Published March 27, 202610 min read

2:09 AM, Saturday. Our on-call phone starts ringing. PagerDuty: API error rate 98%. Thirty seconds later it's 100%. Every single request to our public API is coming back 429. Not throttled. Blocked entirely. The rate limiter logs show the active limit: requests_per_second: 0.

Our AI-powered adaptive rate limiter had decided the correct number of requests per second was zero. It was wrong about what it was protecting us from.

The system we built

Three months earlier we'd shipped what felt like an elegant solution to a real problem. Static rate limits are too conservative during normal traffic and too permissive during genuine attacks. We wanted limits that adapted automatically: tighten when traffic looked anomalous, relax when it looked normal.

Every 60 seconds a Lambda function pulled the last 10 minutes of API traffic metrics (request volume, error rates, unique IPs, request distribution by endpoint) and sent them to GPT-4. The prompt asked GPT-4 to classify the traffic as normal, elevated, suspicious, or attack, and recommend a rate limit adjustment.

  ADAPTIVE RATE LIMITER ARCHITECTURE
  ──────────────────────────────────────────────────────────────

  API Gateway → CloudWatch metrics (rolling 10m window)
                      │
                      ▼ (every 60s)
               Lambda: pull metrics
                      │
                      ▼
               GPT-4 classification prompt
               ┌──────────────────────────────┐
               │ Traffic data: {               │
               │   rps: 847,                   │
               │   unique_ips: 12,             │
               │   top_endpoint: /sync (94%)   │
               │   error_rate: 0.3%            │
               │ }                             │
               │ Classify and recommend limit  │
               └──────────────────────────────┘
                      │
                      ▼
               Apply limit to API Gateway usage plan
               (via AWS SDK UpdateUsagePlan)

  Normal:      → maintain current limit (1000 rps)
  Elevated:    → reduce 20% (800 rps)
  Suspicious:  → reduce 50% (500 rps)
  Attack:      → reduce 90% (100 rps)

It worked well for 11 weeks. It correctly tightened limits during two scraping attempts and loosened them after a product launch traffic spike it correctly classified as legitimate. We were proud of it. We stopped watching it closely.

What happened at 2:07 AM

Our mobile app has a scheduled background sync that runs at 2:00 AM in each timezone. It uploads user activity data, pulls down fresh content, and reconciles local state. For users in UTC+0 through UTC+5:30, that meant a concentrated burst of sync requests hitting our API between 2:00 AM and 2:15 AM UTC.

That night we'd pushed a mobile app update that expanded the sync payload: more endpoints hit, larger request bodies, higher request count per device. The burst was about 4x larger than the previous Saturday's 2 AM window. From the outside it looked like this:

json — metrics snapshot sent to GPT-4 at 2:07 AM

{
  "window": "2026-03-21T01:57:00Z to 2026-03-21T02:07:00Z",
  "requests_per_second": 3847,
  "vs_previous_saturday_same_window": "+412%",
  "unique_ips": 8,
  "unique_user_agents": 3,
  "top_endpoint_pct": 94,
  "top_endpoint": "/api/v2/sync",
  "error_rate_pct": 0.2,
  "avg_response_ms": 340,
  "geographic_concentration": "high — 91% from 3 ASNs"
}

To a human with context this is obviously a mobile sync burst. High volume, low unique IPs because they're NATed mobile users, concentrated on one endpoint, low error rate, consistent user agents. To GPT-4, looking only at numbers and knowing nothing about our mobile app's sync schedule, it looked like a textbook DDoS signature. Massive volume spike, low IP diversity, single endpoint, coordinated timing.

json — GPT-4 classification response at 2:07 AM

{
  "classification": "attack",
  "confidence": 0.91,
  "reasoning": "Traffic volume increased 412% versus baseline with high geographic and endpoint concentration. Low unique IP count with high request volume is consistent with botnet or amplification attack pattern. Recommend immediate aggressive rate limiting.",
  "recommended_rps": 0,
  "recommended_action": "Block all traffic until manual review"
}

The Lambda applied the recommendation. UpdateUsagePlan set the limit to 0. Every request from every user (mobile sync, web dashboard, third-party integrations) started returning 429. 2:09 AM, PagerDuty fired.

The response and the 23-minute outage

The on-call engineer woke up, opened the rate limiter dashboard, saw rps: 0, and immediately knew something was wrong. We had no manual override button. The assumption had been that AI-managed limits wouldn't need one. Restoring the limit meant finding the Lambda, understanding the API Gateway usage plan structure, and manually calling UpdateUsagePlan with a sane value.

That took 23 minutes at 2 AM on a Saturday. Full API outage the entire time. 1,847 mobile sync jobs failed and had to retry. 340 active web sessions got error screens. Three enterprise integrations logged failures and paged their own on-call.

What we changed

We made four changes before re-enabling the system.

A hard floor on rate limit reductions. No single automated decision can cut the limit by more than 70%. If the model recommends 0, we cap it at 30% of current and page a human. Anything more aggressive needs a person in the loop.

typescript — hard floor enforcement

const MAX_SINGLE_REDUCTION = 0.70; // never cut more than 70% in one step

function applyRateLimitRecommendation(current: number, recommended: number): number {
  const floor = Math.ceil(current * (1 - MAX_SINGLE_REDUCTION));
  if (recommended < floor) {
    logger.warn({ current, recommended, floor }, 'Rate limit recommendation exceeded safety floor — capping');
    notifyOnCall('Rate limiter wanted to set RPS below safety floor. Capped. Manual review needed.');
    return floor;
  }
  return recommended;
}

Known traffic patterns as context. The prompt now includes a schedule of known high-volume events: mobile sync windows, nightly batch jobs, weekly report generation. GPT-4 can reason about whether a spike matches a known pattern before classifying it.

A manual override endpoint. A single authenticated API call can restore the rate limit to a safe default in under 2 seconds. The on-call runbook now starts with this step.

Staged rollout of limit changes. Reductions happen in steps now: 20% now, wait 2 minutes, assess again, reduce further if still anomalous. Zero can never be reached in a single step no matter what the model recommends.

Lessons

AI systems acting on incomplete context will make confident, wrong decisions. GPT-4 wasn't wrong about the traffic pattern. It was wrong about what was causing it, because we gave it metrics without meaning. Business context (scheduled syncs, known patterns, recent deploys) isn't optional for an AI system making real-time infrastructure decisions.

Every AI-driven system that can cause an outage needs a manual override reachable in under 60 seconds. We had no override because we assumed the AI would be right. That assumption cost us 23 minutes. Design for AI failure first.

Hard limits and floors aren't an admission of failure, they're the architecture. The model's recommendation is an input. The safety rails are the system. An AI recommending rps: 0 should never actually produce rps: 0 without a human in the loop. Build that constraint in before you ship, not after your first 2 AM incident.