ALL
POSTS

25 posts so far.

March 16, 2026Database9 min read

ECS Autoscaling Fought Our Postgres max_connections at 2AM and Postgres Won

We scaled to 38 ECS tasks during a flash sale. Each task held 10 Postgres connections. Our RDS instance allowed 170. The math was never going to work.

March 15, 2026AI10 min read

The AI Agent That Cleaned Up Our K8s Manifests and Crashed Production

We let a Cursor AI agent refactor our Kubernetes deployment files to remove boilerplate. Six hours later, 34% of requests were failing as pods OOMKilled faster than they could restart.

March 15, 2026AI10 min read

We Upgraded Our Embedding Model and Our RAG Pipeline Returned Wrong Results for 6 Days

We upgraded from text-embedding-ada-002 to text-embedding-3-large without re-embedding our 2.3M documents. Cosine similarity searches silently returned wrong content for six days — valid JSON, HTTP 200, completely wrong answers.

March 16, 2026Security9 min read

We Found Our .env File in 47 Public Forks After a Junior Dev's First Open Source PR

A junior developer forked our private repo to submit a bug fix, unknowingly committed our .env file, and GitHub indexed it. We had production credentials exposed in 47 public forks before anyone noticed.

March 18, 2026AI10 min read

We Set temperature=0 and GPT-4 Still Gave Different Answers — Our Entire CI Pipeline Broke

We built an automated code review pipeline that used GPT-4 with temperature=0 to enforce consistent output. After OpenAI silently updated the model behind the same API endpoint, our determinism assumption collapsed — tests started flipping between pass and fail on identical inputs, and we couldn't reproduce failures locally.

March 17, 2026AI9 min read

Our OpenAI Bill Went From $23 to $4,200 in 48 Hours — A Missing Stop Sequence Did It

We built a feedback-processing pipeline that used GPT-4 to categorise and summarise user feedback. A single missing stop sequence caused the model to loop indefinitely, generating 40-million tokens of circular output over a holiday weekend while our alerts stayed quiet.

March 16, 2026AI11 min read

Our AI Documentation Bot Invented 14 API Routes That Never Existed — 6,000 Users Integrated Against Them

We shipped an LLM-powered documentation assistant trained on our API docs. Within three weeks, it had confidently hallucinated 14 non-existent endpoints. Developers built integrations against them. Support tickets arrived. We had to choose between breaking those integrations or actually building the routes the AI had promised.

March 15, 2026CI/CD9 min read

How a GitHub Actions Cache Hit Skipped Our Tests and Shipped a Regression to 12,000 Users

Our CI pipeline showed green for six consecutive deploys while never running the new test files we added. A cache key tied only to package-lock.json silently restored stale compiled test artifacts — new tests never compiled, never ran, and a broken discount-code checkout reached production for 6 hours.

March 15, 2026Database10 min read

How a NOT NULL Column Migration Locked Our Users Table for 14 Minutes

A routine schema migration to add a NOT NULL column with a default value triggered a full table rewrite in Postgres, holding an exclusive lock on 2.4 million rows and taking our entire platform offline for 14 minutes at 10 AM on a Monday.

March 15, 2026Python10 min read

How SQLAlchemy's Identity Map Served Stale Data to 23,000 API Requests

We managed SQLAlchemy sessions manually in Flask, skipping Flask-SQLAlchemy. Forgetting one line — Session.remove() — turned the ORM's per-thread identity map into a stale-data cache that silently returned outdated records for six hours.

March 14, 2026Architecture10 min read

How AWS SQS Visibility Timeout Caused the Same Order to Be Processed 847 Times

A production war story about how a 30-second SQS visibility timeout turned a slow order processor into a duplicate-charge machine — and how we fixed it with heartbeats and a distributed lock.

March 14, 2026Architecture10 min read

How a Race Condition in Our Cron Job Sent 2.3 Million Duplicate Emails in One Night

A nightly email digest cron job was running on two servers simultaneously without a distributed lock — what started as a minor scheduling overlap turned into a 2.3 million email catastrophe that got our domain blacklisted before sunrise.

March 14, 2026Architecture9 min read

How Next.js 15's Full Route Cache Served Stale Prices at Checkout for 3 Hours

After migrating a SaaS checkout flow to Next.js 15 App Router, our price display layer silently served cached values — not the live database prices — costing us 3 hours of confused customers and 19 manual refunds.

March 14, 2026Mobile9 min read

How a Single Power User's Post Triggered 45,000 DB Queries and Crashed Our Mobile API

A synchronous push notification fanout loop for a user with 45,000 followers exhausted our Flask database connection pool in 90 seconds, failing 62% of mobile requests for 3 hours.

March 13, 2026Security10 min read

How Rotating a JWT Secret Logged Out 34,000 Users and Exposed a Session Design Flaw

A routine security rotation invalidated every active session simultaneously, triggered a support flood, and revealed that our JWT architecture had no graceful degradation path whatsoever.

March 13, 2026Docker9 min read

How a DigitalOcean Firewall Rule Silently Dropped 23% of Production Traffic for 11 Days

Intermittent user timeouts, normal server metrics, and zero firewall logs — how a stateless firewall rule was killing TCP connections before they reached Nginx, and why it took eleven days to find it.

March 12, 2026Architecture10 min read

How a Redis Cache Key Missing One Field Leaked Client Data Across Tenants for 72 Hours

A SaaS platform cached API responses by resource ID alone — when two tenants happened to share the same integer ID, one client spent three days reading another's confidential records.

March 12, 2026Docker8 min read

How a Redis Connection Leak Crashed Our AWS ECS Cluster at 3AM

A Redis client spawned inside getServerSideProps accumulated 8,847 open connections over six hours, OOM-killed every ECS task, and took the service down for 47 minutes before we found the root cause.

March 11, 2026React9 min read

How a Missing useCallback Triggered 10,000 API Requests Per Minute in Production

A React search component's unstable function reference created an infinite useEffect loop that sent 10,400 req/min to our backend until the rate-limiter started blacklisting our own users.

March 11, 2026Python9 min read

The Shared State Trap: How a FastAPI 'Optimisation' Leaked User Data

We replaced Flask's request-scoped g with a plain dict during migration. Under async concurrency, that dict silently served one tenant's data to a completely different user.

March 10, 2026NodeJS11 min read

The Invisible Bottleneck: How One Sync Call Froze Our Node.js API

A single fs.readFileSync buried in a utility function seemed harmless in development — in production under real traffic, it silently froze every request in the system for 700ms at a time.

March 9, 2026Architecture13 min read

We Killed the PHP Monolith. It Took 18 Months and One Client's Data.

What looked like a clean strangler-fig migration turned into 18 months of session bridges, soft-delete mismatches, and hard lessons about the implicit contracts hiding inside every legacy codebase.

March 7, 2026Database12 min read

The Friday Deploy That Taught Me to Respect PostgreSQL

A four-line SQL query worked perfectly in development — in production, with 8 million rows, it held the database hostage for 47 minutes and took down an entire SaaS platform on a Friday afternoon.

March 5, 2026CI/CD9 min read

We Deployed on a Friday. Here's What Happened Next.

A production deploy at 4:30 PM on a Friday turned a routine release into a 6-hour incident — and permanently changed how I think about automation and discipline.

March 3, 2026ElasticSearch14 min read

The Night the Cluster Went Silent

It was 11:47 PM. Search was down. 40,000 users couldn't find anything. This is the story of how a single shard misconfiguration quietly ate our cluster — and what we learned after rebuilding it from scratch.