The AI Agent That Cleaned Up Our K8s Manifests and Crashed Production
At 2:47 AM my phone started screaming. PagerDuty alert — P1 — error rate spiked to 34%. P99 latency was 18 seconds. Pods were restarting every 90 seconds. We had 847 OOMKill events in the last four hours. The last deploy had been six hours ago and everything looked fine. I was about to learn the most expensive lesson of my career about trusting AI-generated infrastructure code.
The Tuesday Afternoon That Felt Like Progress
Our Kubernetes manifests were a mess. Four years of patches, rushed hotfixes, and copy-paste deployments had left us with 31 YAML files across 4 services — inconsistent label conventions, stale annotations, duplicate environment variables, and resource definitions that nobody could confidently explain anymore.
I decided to use Cursor's agent mode to clean it all up. The prompt felt reasonable:
Refactor these Kubernetes deployment manifests: 1. Standardize label conventions to app.kubernetes.io/* 2. Remove duplicate and stale environment variables 3. Consolidate probe configurations to match our current health endpoints 4. Clean up any obviously wrong or overly conservative resource settings
The agent ran for about 12 minutes, touching 28 of the 31 files. The diff looked clean.
Labels standardized. Dead env vars gone. Probe paths updated to /health.
I did a quick review, opened a PR, got a thumbs-up from a teammate, and merged it.
CI passed. Staging looked fine. We deployed to production at 8:51 PM. Dashboards were nominal. I went to bed at midnight thinking we'd paid down some real tech debt.
False Assumptions at Every Layer
Here's what I believed going into this, and why every assumption was wrong:
Assumption 1: "The AI only touches what I asked it to touch."
The prompt said "clean up obviously wrong or overly conservative resource settings."
The agent interpreted conservative as unnecessarily restrictive.
It found our memory: "512Mi" limits on the API pods and — seeing that
we'd had OOMKill events logged in comments inside the YAML — "helpfully" removed
the limits entirely. No memory ceiling. No CPU throttle. Just vibes.
Assumption 2: "Staging would catch this."
Staging runs at 5% of production traffic. An OOM condition triggered by concurrent
request spikes at 2 AM simply never manifested in staging's daytime load patterns.
Assumption 3: "A green CI run means the manifests are correct."
Our CI validates YAML syntax and runs kubectl diff --dry-run.
Neither catches semantic regressions like missing resource limits.
The manifests were syntactically perfect.
Assumption 4: "I reviewed the diff."
I reviewed 28 files worth of diff in about 8 minutes. At that rate, you're not reviewing —
you're pattern-matching for obvious disasters. The resource limit removal was buried inside
a larger block refactor that changed indentation and key ordering simultaneously.
My eyes skipped right over it.
The 2 AM Investigation
I pulled up the cluster. The picture was grim:
CLUSTER STATE — 02:47 AM
api-service 847 OOMKill events / 4 hours
Memory: 2.1GB consumed (no ceiling)
Restart count: 134 pods
Status: CrashLoopBackOff on 9/12 replicas
worker-service Memory: 1.8GB (no ceiling)
Restart count: 89 pods
Throttled CPU: 0% (CPU limit also removed)
HPA (Horizontal Pod Autoscaler)
Trying to scale from 12 → 40 replicas
Pods starting, OOMKilling, terminating
Faster than readiness probes can pass
Node memory pressure: 3/5 worker nodes in MemoryPressure=True
kubelet evicting pods aggressively
First theory: a memory leak introduced in the same deploy. I pulled the application code diff — the code changes were minor, a two-line tweak to a database query timeout. No new allocations. I ruled it out in 8 minutes.
Second theory: traffic spike. I checked the ingress metrics. Traffic at 2 AM was actually lower than average — about 60% of daytime volume. Ruled out in 3 minutes.
Third theory: something in the manifest diff. I finally opened the full git diff and started reading carefully. Line 847 of the diff:
resources: - limits: - memory: "512Mi" - cpu: "500m" - requests: - memory: "256Mi" - cpu: "250m" + requests: + memory: "256Mi" + cpu: "250m"
There it was. The limits block — gone. Four services, twelve deployments, all missing
their memory and CPU ceilings. The agent had seen our old OOMKill comments
(# TODO: OOMKilled twice in Nov — investigate memory usage) and concluded
the limits were the problem. Classic AI reasoning failure: correlation treated as causation.
Root Cause: Unbounded Memory + Kubernetes Node Pressure Cascade
Without memory limits, each pod was free to consume as much node RAM as it wanted. Our application has a background job that processes webhooks in batches — normally bounded by the 512Mi limit to about 200 concurrent in-memory payloads. Without the limit, it processed all 1,400 queued webhooks simultaneously, allocating 2.1GB per pod instance.
When node memory pressure crossed the threshold, kubelet started evicting pods — not just our pods, but also system components on those nodes. The eviction triggered the HPA to spin up replacement pods. Those replacement pods hit the same webhook queue and immediately consumed 2GB of RAM each. The nodes hit MemoryPressure again. Eviction loop.
THE CASCADE
Pod starts
│
▼
Webhook worker: no limit → 2.1GB RAM
│
▼
Node MemoryPressure threshold crossed
│
▼
kubelet evicts pod (OOMKill)
│
▼
HPA sees fewer replicas → spins new pod
│
└──────────────────────────────────┐
▼
New pod hits same queue
2.1GB RAM in 45 seconds
Node MemoryPressure again
(loop repeats, 134 times)
The 34% error rate wasn't random — it was exactly the percentage of pods that were in the CrashLoopBackOff state at any given moment, unable to serve traffic.
The Fix: Two Parts, 22 Minutes
The immediate fix was a targeted rollback of the resource limits — not a full deploy rollback, which would have re-broken the probe paths we'd actually fixed correctly:
# Patch resource limits back in-place without a full rollback
kubectl patch deployment api-service -n production --type=json -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/resources/limits",
"value": {"memory": "512Mi", "cpu": "500m"}
}
]'
# Repeat for all affected deployments
# Then force a rolling restart to kill the OOMing pods
kubectl rollout restart deployment/api-service -n production
Twelve minutes after applying patches to all four services, error rate dropped from 34% to 0.2%. P99 latency recovered from 18,000ms to 180ms. Total incident duration: 94 minutes from first alert.
The longer-term fix was adding a LimitRange to the namespace — a cluster-level
safety net that enforces default limits on any pod that doesn't specify them:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default: # applied if limits not specified
memory: "512Mi"
cpu: "500m"
defaultRequest: # applied if requests not specified
memory: "256Mi"
cpu: "250m"
max:
memory: "2Gi"
cpu: "2000m"
Now even if a future AI agent, intern, or sleep-deprived engineer removes resource limits, the namespace-level policy kicks in. Pods can't go fully unbounded.
What We Changed in the AI-Assisted Workflow
I didn't stop using AI for infrastructure refactoring. I stopped using it unsafely.
1. Explicit negative-space prompting. Every infra refactor prompt now includes a "do not touch" list: "Do not modify resource limits, requests, security contexts, or RBAC rules under any circumstances." Vague instructions produce vague boundaries.
2. Automated diff validation for critical fields. We added a CI check that
diffs the YAML and exits non-zero if resources.limits or resources.requests
blocks are removed from any deployment:
# .github/workflows/manifest-guard.yml
- name: Check resource limits not removed
run: |
git diff origin/main...HEAD -- '**/*.yaml' | grep '^-.*limits:' && {
echo "ERROR: Resource limits removed from manifest. Review required."
exit 1
} || echo "Resource limits intact"
3. Scoped AI tasks. "Refactor all 31 files" is too broad for an AI agent touching production infrastructure. We now scope tasks to single files or single concern areas — never "clean up everything" in one pass.
4. LimitRange as the last line of defence. Applied namespace-wide now. If a manifest ever ships without limits again, the cluster itself enforces a sane ceiling.
Lessons Learned
Metrics from this incident:
— 94 minutes total downtime
— 847 OOMKill events across 4 services
— 134 pod restarts in the crash loop
— P99 latency peak: 18,000ms (baseline: 180ms)
— Error rate peak: 34%
— Recovery after patch: 12 minutes
— Files touched by AI agent: 28 of 31
— Critical removals missed in review: 12 resources blocks across 4 deployments
The failure mode here wasn't that AI is unreliable. It was that I gave an AI agent ambiguous authority over safety-critical configuration and then rubber-stamped a 28-file diff in 8 minutes. Those are human process failures.
The scariest part? The AI's reasoning was internally consistent. It saw OOMKill history, concluded limits were causing it, and removed them. A junior engineer following the same logic might have done exactly the same thing. The difference is a junior engineer would ask before deleting the safety rails. The agent just did it.
AI coding tools are force multipliers — which means they multiply both your good decisions and your bad prompts. Give them vague latitude on production infrastructure, and you will pay for it at 2:47 AM.