The AI Agent That Cleaned Up Our K8s Manifests and Crashed Production
← Back
March 15, 2026AI10 min read

The AI Agent That Cleaned Up Our K8s Manifests and Crashed Production

Published March 15, 202610 min read

At 2:47 AM my phone started screaming. PagerDuty alert — P1 — error rate spiked to 34%. P99 latency was 18 seconds. Pods were restarting every 90 seconds. We had 847 OOMKill events in the last four hours. The last deploy had been six hours ago and everything looked fine. I was about to learn the most expensive lesson of my career about trusting AI-generated infrastructure code.


The Tuesday Afternoon That Felt Like Progress

Our Kubernetes manifests were a mess. Four years of patches, rushed hotfixes, and copy-paste deployments had left us with 31 YAML files across 4 services — inconsistent label conventions, stale annotations, duplicate environment variables, and resource definitions that nobody could confidently explain anymore.

I decided to use Cursor's agent mode to clean it all up. The prompt felt reasonable:

Refactor these Kubernetes deployment manifests:
1. Standardize label conventions to app.kubernetes.io/* 
2. Remove duplicate and stale environment variables
3. Consolidate probe configurations to match our current health endpoints
4. Clean up any obviously wrong or overly conservative resource settings

The agent ran for about 12 minutes, touching 28 of the 31 files. The diff looked clean. Labels standardized. Dead env vars gone. Probe paths updated to /health. I did a quick review, opened a PR, got a thumbs-up from a teammate, and merged it.

CI passed. Staging looked fine. We deployed to production at 8:51 PM. Dashboards were nominal. I went to bed at midnight thinking we'd paid down some real tech debt.


False Assumptions at Every Layer

Here's what I believed going into this, and why every assumption was wrong:

Assumption 1: "The AI only touches what I asked it to touch."
The prompt said "clean up obviously wrong or overly conservative resource settings." The agent interpreted conservative as unnecessarily restrictive. It found our memory: "512Mi" limits on the API pods and — seeing that we'd had OOMKill events logged in comments inside the YAML — "helpfully" removed the limits entirely. No memory ceiling. No CPU throttle. Just vibes.

Assumption 2: "Staging would catch this."
Staging runs at 5% of production traffic. An OOM condition triggered by concurrent request spikes at 2 AM simply never manifested in staging's daytime load patterns.

Assumption 3: "A green CI run means the manifests are correct."
Our CI validates YAML syntax and runs kubectl diff --dry-run. Neither catches semantic regressions like missing resource limits. The manifests were syntactically perfect.

Assumption 4: "I reviewed the diff."
I reviewed 28 files worth of diff in about 8 minutes. At that rate, you're not reviewing — you're pattern-matching for obvious disasters. The resource limit removal was buried inside a larger block refactor that changed indentation and key ordering simultaneously. My eyes skipped right over it.


The 2 AM Investigation

I pulled up the cluster. The picture was grim:

CLUSTER STATE — 02:47 AM

api-service          847 OOMKill events / 4 hours
                     Memory: 2.1GB consumed (no ceiling)
                     Restart count: 134 pods
                     Status: CrashLoopBackOff on 9/12 replicas

worker-service       Memory: 1.8GB (no ceiling)  
                     Restart count: 89 pods
                     Throttled CPU: 0% (CPU limit also removed)

HPA (Horizontal Pod Autoscaler)
                     Trying to scale from 12 → 40 replicas
                     Pods starting, OOMKilling, terminating
                     Faster than readiness probes can pass

Node memory pressure: 3/5 worker nodes in MemoryPressure=True
                     kubelet evicting pods aggressively

First theory: a memory leak introduced in the same deploy. I pulled the application code diff — the code changes were minor, a two-line tweak to a database query timeout. No new allocations. I ruled it out in 8 minutes.

Second theory: traffic spike. I checked the ingress metrics. Traffic at 2 AM was actually lower than average — about 60% of daytime volume. Ruled out in 3 minutes.

Third theory: something in the manifest diff. I finally opened the full git diff and started reading carefully. Line 847 of the diff:

 resources:
-  limits:
-    memory: "512Mi"
-    cpu: "500m"
-  requests:
-    memory: "256Mi"
-    cpu: "250m"
+  requests:
+    memory: "256Mi"
+    cpu: "250m"

There it was. The limits block — gone. Four services, twelve deployments, all missing their memory and CPU ceilings. The agent had seen our old OOMKill comments (# TODO: OOMKilled twice in Nov — investigate memory usage) and concluded the limits were the problem. Classic AI reasoning failure: correlation treated as causation.


Root Cause: Unbounded Memory + Kubernetes Node Pressure Cascade

Without memory limits, each pod was free to consume as much node RAM as it wanted. Our application has a background job that processes webhooks in batches — normally bounded by the 512Mi limit to about 200 concurrent in-memory payloads. Without the limit, it processed all 1,400 queued webhooks simultaneously, allocating 2.1GB per pod instance.

When node memory pressure crossed the threshold, kubelet started evicting pods — not just our pods, but also system components on those nodes. The eviction triggered the HPA to spin up replacement pods. Those replacement pods hit the same webhook queue and immediately consumed 2GB of RAM each. The nodes hit MemoryPressure again. Eviction loop.

THE CASCADE

  Pod starts
      │
      ▼
  Webhook worker: no limit → 2.1GB RAM
      │
      ▼
  Node MemoryPressure threshold crossed
      │
      ▼
  kubelet evicts pod (OOMKill)
      │
      ▼
  HPA sees fewer replicas → spins new pod
      │
      └──────────────────────────────────┐
                                         ▼
                              New pod hits same queue
                              2.1GB RAM in 45 seconds
                              Node MemoryPressure again
                              (loop repeats, 134 times)

The 34% error rate wasn't random — it was exactly the percentage of pods that were in the CrashLoopBackOff state at any given moment, unable to serve traffic.


The Fix: Two Parts, 22 Minutes

The immediate fix was a targeted rollback of the resource limits — not a full deploy rollback, which would have re-broken the probe paths we'd actually fixed correctly:

# Patch resource limits back in-place without a full rollback
kubectl patch deployment api-service -n production --type=json -p='[
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/resources/limits",
    "value": {"memory": "512Mi", "cpu": "500m"}
  }
]'

# Repeat for all affected deployments
# Then force a rolling restart to kill the OOMing pods
kubectl rollout restart deployment/api-service -n production

Twelve minutes after applying patches to all four services, error rate dropped from 34% to 0.2%. P99 latency recovered from 18,000ms to 180ms. Total incident duration: 94 minutes from first alert.

The longer-term fix was adding a LimitRange to the namespace — a cluster-level safety net that enforces default limits on any pod that doesn't specify them:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - type: Container
    default:          # applied if limits not specified
      memory: "512Mi"
      cpu: "500m"
    defaultRequest:   # applied if requests not specified
      memory: "256Mi"
      cpu: "250m"
    max:
      memory: "2Gi"
      cpu: "2000m"

Now even if a future AI agent, intern, or sleep-deprived engineer removes resource limits, the namespace-level policy kicks in. Pods can't go fully unbounded.


What We Changed in the AI-Assisted Workflow

I didn't stop using AI for infrastructure refactoring. I stopped using it unsafely.

1. Explicit negative-space prompting. Every infra refactor prompt now includes a "do not touch" list: "Do not modify resource limits, requests, security contexts, or RBAC rules under any circumstances." Vague instructions produce vague boundaries.

2. Automated diff validation for critical fields. We added a CI check that diffs the YAML and exits non-zero if resources.limits or resources.requests blocks are removed from any deployment:

# .github/workflows/manifest-guard.yml
- name: Check resource limits not removed
  run: |
    git diff origin/main...HEAD -- '**/*.yaml' |     grep '^-.*limits:' && {
      echo "ERROR: Resource limits removed from manifest. Review required."
      exit 1
    } || echo "Resource limits intact"

3. Scoped AI tasks. "Refactor all 31 files" is too broad for an AI agent touching production infrastructure. We now scope tasks to single files or single concern areas — never "clean up everything" in one pass.

4. LimitRange as the last line of defence. Applied namespace-wide now. If a manifest ever ships without limits again, the cluster itself enforces a sane ceiling.


Lessons Learned

Metrics from this incident:
— 94 minutes total downtime
— 847 OOMKill events across 4 services
— 134 pod restarts in the crash loop
— P99 latency peak: 18,000ms (baseline: 180ms)
— Error rate peak: 34%
— Recovery after patch: 12 minutes
— Files touched by AI agent: 28 of 31
— Critical removals missed in review: 12 resources blocks across 4 deployments

The failure mode here wasn't that AI is unreliable. It was that I gave an AI agent ambiguous authority over safety-critical configuration and then rubber-stamped a 28-file diff in 8 minutes. Those are human process failures.

The scariest part? The AI's reasoning was internally consistent. It saw OOMKill history, concluded limits were causing it, and removed them. A junior engineer following the same logic might have done exactly the same thing. The difference is a junior engineer would ask before deleting the safety rails. The agent just did it.

AI coding tools are force multipliers — which means they multiply both your good decisions and your bad prompts. Give them vague latitude on production infrastructure, and you will pay for it at 2:47 AM.

Share this
← All Posts10 min read