Postgres Internals 106: Scaling Writes
PostgresProduction

Postgres Internals 106: Scaling Writes

March 26, 202611 min readPART 13 / 19

Read replicas buy you a lot of headroom on the read side. Writes are different — every INSERT, UPDATE, and DELETE goes to one primary. That's the bottleneck. This is Part 6 of the Postgres Internals series — how to scale write throughput with partitioning, how to stop doing 10,000 round trips when one would do, and how to handle concurrent upserts without race conditions.

The write scaling mental model

Before touching any pattern, it helps to know which tier of problem you're solving:

Write volume low     → Simple INSERT, no batching needed
Write volume medium  → Batch inserts, UPSERT for idempotency
Write volume high    → Partition by time/key, batch COPY, async writes
Write volume extreme → Queue in front (Kafka/SQS), Postgres as sink not source

Most applications live in the medium tier and never need anything beyond batching and UPSERT. But when you hit high volume — event tracking, audit logs, IoT ingestion, analytics writes — partitioning is what keeps Postgres from becoming the thing you rewrite your architecture around.

Table partitioning

Partitioning splits one logical table into physical sub-tables. Postgres routes inserts and queries automatically. The most common split is by time — range partitioning on a date column.

CREATE TABLE events (
  id         BIGSERIAL,
  created_at TIMESTAMPTZ NOT NULL,
  user_id    BIGINT,
  type       TEXT,
  payload    JSONB
) PARTITION BY RANGE (created_at);

-- Create partition for March 2026
CREATE TABLE events_2026_03
  PARTITION OF events
  FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

-- Create partition for April 2026
CREATE TABLE events_2026_04
  PARTITION OF events
  FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

Every INSERT into events automatically lands in the correct partition. Queries with a date filter only scan the relevant partition — not the full table.

events (partitioned by created_at)
├── events_2026_01  (Jan — 40M rows, archived)
├── events_2026_02  (Feb — 38M rows, archived)
├── events_2026_03  (Mar — 12M rows, active)
└── events_2026_04  (Apr — empty, ready)

Why partitioning helps writes

The main write benefit is vacuum efficiency. Postgres autovacuum runs per-partition, not on the full table. Old partitions with no writes get vacuumed once and never touched again. On an unpartitioned table with 100M rows, vacuum scans everything. On a partitioned table, it only works on the active partition.

For archival, you can detach old partitions without a DELETE scan:

-- Detach January — instant, no table scan, no bloat
ALTER TABLE events DETACH PARTITION events_2026_01;

-- Move to cold storage, drop, or archive
DROP TABLE events_2026_01;

The partition timing trap

Partitions don't create themselves. If you forget to create next month's partition, inserts fail at midnight on the 1st with a constraint violation. Either script partition creation in advance (a cron that runs on the 25th of each month), or use pg_partman which handles this automatically.

-- pg_partman: create 3 months of future partitions automatically
SELECT partman.create_parent(
  p_parent_table := 'public.events',
  p_control      := 'created_at',
  p_type         := 'range',
  p_interval     := 'monthly',
  p_premake      := 3  -- keep 3 future partitions ready
);

Batch inserts

Never insert one row at a time in a loop. Every INSERT is a network round trip, a transaction open/commit, and a WAL write. At 10,000 rows that's 10,000 of each.

# ❌ Slow — 10,000 round trips
for row in data:
    await conn.execute("INSERT INTO events VALUES ($1, $2, $3)", row)

# ✅ Fast — 1 round trip, batched in chunks of 1000
BATCH_SIZE = 1000
for i in range(0, len(data), BATCH_SIZE):
    batch = data[i:i + BATCH_SIZE]
    await conn.executemany(
        "INSERT INTO events (user_id, type, payload) VALUES ($1, $2, $3)",
        batch
    )

For bulk loads (hundreds of thousands of rows), COPY is faster still. It bypasses the query planner, skips per-row overhead, and streams data directly into the table:

import io

# Build CSV in memory
csv_data = io.StringIO()
for row in data:
    csv_data.write(f"{row['user_id']}	{row['type']}	{row['payload']}
")
csv_data.seek(0)

# Stream into Postgres — no round trips, no per-row overhead
async with conn.transaction():
    await conn.copy_to_table(
        'events',
        source=csv_data,
        columns=['user_id', 'type', 'payload'],
        format='text'
    )

Batch size trade-offs

Batch too small  → Many round trips, overhead dominates
Batch too large  → Long transactions, large WAL, extended locks
Sweet spot       → 500–1000 rows per batch for most workloads

Start at 1000 and tune down if you see lock contention or replication lag spikes. The batch job generating WAL faster than replicas can consume it is a real failure mode.

UPSERT with ON CONFLICT

The pattern: insert if the row doesn't exist, update if it does. The naive approach is a SELECT then an INSERT-or-UPDATE — two queries, a round trip, and a race condition if two threads run concurrently. ON CONFLICT collapses this into one atomic statement.

INSERT INTO user_stats (user_id, total_clicks)
VALUES (42, 1)
ON CONFLICT (user_id)
DO UPDATE SET
  total_clicks = user_stats.total_clicks + EXCLUDED.total_clicks,
  updated_at   = NOW();

EXCLUDED refers to the row that would have been inserted. This reads as: if user 42 already exists, add the new click count to their existing total. If they don't exist, insert them with total_clicks = 1. It's atomic — no race condition, no duplicate key errors.

The two variants

-- DO NOTHING: idempotent event logging (duplicate events silently ignored)
INSERT INTO processed_events (event_id, processed_at)
VALUES ($1, NOW())
ON CONFLICT (event_id) DO NOTHING;

-- DO UPDATE: merge / increment / overwrite
INSERT INTO user_stats (user_id, total_clicks)
VALUES ($1, $2)
ON CONFLICT (user_id)
DO UPDATE SET total_clicks = EXCLUDED.total_clicks;

The high-concurrency trap

Under extreme write concurrency, ON CONFLICT DO UPDATE can produce deadlocks. The scenario: multiple transactions simultaneously try to upsert the same key. They all detect the conflict and try to acquire a row lock to update — but they acquire them in different orders. Postgres detects the cycle and kills one transaction.

If you see this pattern in your logs:

ERROR: deadlock detected
DETAIL: Process 1234 waits for ShareLock on transaction 5678;
        Process 5678 waits for ShareLock on transaction 1234.

The mitigation for counter increments at high throughput: don't upsert in realtime. Write events to a queue or a staging table, then flush-aggregate periodically:

-- Flush staging table into stats every 10 seconds
INSERT INTO user_stats (user_id, total_clicks)
SELECT user_id, COUNT(*) FROM click_staging
GROUP BY user_id
ON CONFLICT (user_id)
DO UPDATE SET total_clicks = user_stats.total_clicks + EXCLUDED.total_clicks;

-- Then truncate staging
TRUNCATE click_staging;

Putting it together

Most write scaling problems follow the same escalation path:

  1. Single row inserts in a loop → batch them. Immediate 10–50× throughput improvement with one code change.
  2. Race conditions on concurrent writes → use ON CONFLICT. Eliminate the SELECT-then-INSERT pattern entirely.
  3. Table bloat, slow vacuum, large scans → partition by time. Let old data age out cleanly.
  4. Write throughput still not enough → queue writes (Kafka, SQS), let Postgres consume asynchronously. At this point you've left typical web app territory.

Quiz

Q1. You partition an events table by month. A developer runs SELECT * FROM events WHERE user_id = 99 — no date filter. What happens?

Postgres scans every partition. Without a filter on the partition key (created_at), the planner has no way to eliminate any partition. It generates a parallel sequential scan across all of them — potentially worse than an unpartitioned table, because the planner overhead for each partition adds up.

This is called partition pruning failure. Fix: add user_id as a secondary index on each partition, or redesign to include a date constraint in the query. If the access pattern is always by user_id with no time bound, a hash partition on user_id would be a better choice.

Q2. You're importing 500,000 rows from a CSV. A junior dev writes a loop with one INSERT per row. Name two faster approaches and why each is faster.

Batch INSERT (executemany / multi-row VALUES): Reduces 500,000 round trips to ~500. Each batch is one network call, one transaction, one WAL flush. The per-row overhead (parsing, planning, committing) drops from 500,000× to 500×.

COPY: Streams data directly into the table without going through the query planner at all. No per-row parsing, no constraint checks mid-stream (they run at the end), no WAL record per row — WAL is written in bulk. Typically 3–10× faster than batched INSERT for bulk loads.

Q3. Two servers simultaneously run an UPSERT for page_id = 5 where views = 10. Is this safe?

Yes, it's safe — and the result is deterministic. ON CONFLICT DO UPDATE is atomic at the row level. The first server to execute acquires a row lock. The second server waits. When the first commits, the second sees the updated row and applies its own increment.

The result is always views = 12 — not 11 (lost update) and not a duplicate key error. This is precisely what makes UPSERT valuable over SELECT-then-INSERT.

The caveat: at extreme concurrency with many processes all trying to update the same row simultaneously, this serialisation causes lock queuing and throughput drops. That's when you move to the batch-flush pattern.

Q4. A payment service upserts a payment_status table. Under load, deadlocks appear in logs. The UPSERT looks correct individually. What's the likely cause?

Multiple rows being upserted in different orders across concurrent transactions. If transaction A upserts rows [payment_1, payment_2] and transaction B upserts [payment_2, payment_1], they each acquire a lock on their first row and then wait on the second — classic deadlock cycle.

The fix: ensure all transactions upsert rows in a consistent order (e.g., sorted by payment_id ascending). Consistent ordering eliminates the cycle — transaction B will always wait on payment_1 first, which transaction A holds and will release before moving to payment_2.

# Sort before batch upsert to prevent deadlock
payments.sort(key=lambda p: p['id'])
await upsert_batch(payments)

Part 6 done. Next up — Part 7: Transactions & Concurrency. Isolation levels, what they actually protect against, deadlocks, and the race condition hiding in your application right now.

← PREV
Staff Prep 07: Zero-Downtime Postgres Migrations
← All Postgres Posts