Postgres Internals 106: Scaling Writes
Read replicas buy you a lot of headroom on the read side. Writes are different — every INSERT, UPDATE, and DELETE goes to one primary. That's the bottleneck. This is Part 6 of the Postgres Internals series — how to scale write throughput with partitioning, how to stop doing 10,000 round trips when one would do, and how to handle concurrent upserts without race conditions.
The write scaling mental model
Before touching any pattern, it helps to know which tier of problem you're solving:
Write volume low → Simple INSERT, no batching needed Write volume medium → Batch inserts, UPSERT for idempotency Write volume high → Partition by time/key, batch COPY, async writes Write volume extreme → Queue in front (Kafka/SQS), Postgres as sink not source
Most applications live in the medium tier and never need anything beyond batching and UPSERT. But when you hit high volume — event tracking, audit logs, IoT ingestion, analytics writes — partitioning is what keeps Postgres from becoming the thing you rewrite your architecture around.
Table partitioning
Partitioning splits one logical table into physical sub-tables. Postgres routes inserts and queries automatically. The most common split is by time — range partitioning on a date column.
CREATE TABLE events (
id BIGSERIAL,
created_at TIMESTAMPTZ NOT NULL,
user_id BIGINT,
type TEXT,
payload JSONB
) PARTITION BY RANGE (created_at);
-- Create partition for March 2026
CREATE TABLE events_2026_03
PARTITION OF events
FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');
-- Create partition for April 2026
CREATE TABLE events_2026_04
PARTITION OF events
FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');
Every INSERT into events automatically lands in the correct partition. Queries with a date filter
only scan the relevant partition — not the full table.
events (partitioned by created_at) ├── events_2026_01 (Jan — 40M rows, archived) ├── events_2026_02 (Feb — 38M rows, archived) ├── events_2026_03 (Mar — 12M rows, active) └── events_2026_04 (Apr — empty, ready)
Why partitioning helps writes
The main write benefit is vacuum efficiency. Postgres autovacuum runs per-partition, not on the full table. Old partitions with no writes get vacuumed once and never touched again. On an unpartitioned table with 100M rows, vacuum scans everything. On a partitioned table, it only works on the active partition.
For archival, you can detach old partitions without a DELETE scan:
-- Detach January — instant, no table scan, no bloat
ALTER TABLE events DETACH PARTITION events_2026_01;
-- Move to cold storage, drop, or archive
DROP TABLE events_2026_01;
The partition timing trap
Partitions don't create themselves. If you forget to create next month's partition, inserts fail at midnight
on the 1st with a constraint violation. Either script partition creation in advance (a cron that runs on
the 25th of each month), or use pg_partman which handles this automatically.
-- pg_partman: create 3 months of future partitions automatically
SELECT partman.create_parent(
p_parent_table := 'public.events',
p_control := 'created_at',
p_type := 'range',
p_interval := 'monthly',
p_premake := 3 -- keep 3 future partitions ready
);
Batch inserts
Never insert one row at a time in a loop. Every INSERT is a network round trip, a transaction open/commit, and a WAL write. At 10,000 rows that's 10,000 of each.
# ❌ Slow — 10,000 round trips
for row in data:
await conn.execute("INSERT INTO events VALUES ($1, $2, $3)", row)
# ✅ Fast — 1 round trip, batched in chunks of 1000
BATCH_SIZE = 1000
for i in range(0, len(data), BATCH_SIZE):
batch = data[i:i + BATCH_SIZE]
await conn.executemany(
"INSERT INTO events (user_id, type, payload) VALUES ($1, $2, $3)",
batch
)
For bulk loads (hundreds of thousands of rows), COPY is faster still. It bypasses the query
planner, skips per-row overhead, and streams data directly into the table:
import io
# Build CSV in memory
csv_data = io.StringIO()
for row in data:
csv_data.write(f"{row['user_id']} {row['type']} {row['payload']}
")
csv_data.seek(0)
# Stream into Postgres — no round trips, no per-row overhead
async with conn.transaction():
await conn.copy_to_table(
'events',
source=csv_data,
columns=['user_id', 'type', 'payload'],
format='text'
)
Batch size trade-offs
Batch too small → Many round trips, overhead dominates Batch too large → Long transactions, large WAL, extended locks Sweet spot → 500–1000 rows per batch for most workloads
Start at 1000 and tune down if you see lock contention or replication lag spikes. The batch job generating WAL faster than replicas can consume it is a real failure mode.
UPSERT with ON CONFLICT
The pattern: insert if the row doesn't exist, update if it does. The naive approach is a SELECT then
an INSERT-or-UPDATE — two queries, a round trip, and a race condition if two threads run concurrently.
ON CONFLICT collapses this into one atomic statement.
INSERT INTO user_stats (user_id, total_clicks)
VALUES (42, 1)
ON CONFLICT (user_id)
DO UPDATE SET
total_clicks = user_stats.total_clicks + EXCLUDED.total_clicks,
updated_at = NOW();
EXCLUDED refers to the row that would have been inserted. This reads as: if user 42
already exists, add the new click count to their existing total. If they don't exist, insert them with
total_clicks = 1. It's atomic — no race condition, no duplicate key errors.
The two variants
-- DO NOTHING: idempotent event logging (duplicate events silently ignored)
INSERT INTO processed_events (event_id, processed_at)
VALUES ($1, NOW())
ON CONFLICT (event_id) DO NOTHING;
-- DO UPDATE: merge / increment / overwrite
INSERT INTO user_stats (user_id, total_clicks)
VALUES ($1, $2)
ON CONFLICT (user_id)
DO UPDATE SET total_clicks = EXCLUDED.total_clicks;
The high-concurrency trap
Under extreme write concurrency, ON CONFLICT DO UPDATE can produce deadlocks. The scenario:
multiple transactions simultaneously try to upsert the same key. They all detect the conflict and
try to acquire a row lock to update — but they acquire them in different orders. Postgres detects
the cycle and kills one transaction.
If you see this pattern in your logs:
ERROR: deadlock detected
DETAIL: Process 1234 waits for ShareLock on transaction 5678;
Process 5678 waits for ShareLock on transaction 1234.
The mitigation for counter increments at high throughput: don't upsert in realtime. Write events to a queue or a staging table, then flush-aggregate periodically:
-- Flush staging table into stats every 10 seconds
INSERT INTO user_stats (user_id, total_clicks)
SELECT user_id, COUNT(*) FROM click_staging
GROUP BY user_id
ON CONFLICT (user_id)
DO UPDATE SET total_clicks = user_stats.total_clicks + EXCLUDED.total_clicks;
-- Then truncate staging
TRUNCATE click_staging;
Putting it together
Most write scaling problems follow the same escalation path:
- Single row inserts in a loop → batch them. Immediate 10–50× throughput improvement with one code change.
- Race conditions on concurrent writes → use
ON CONFLICT. Eliminate the SELECT-then-INSERT pattern entirely. - Table bloat, slow vacuum, large scans → partition by time. Let old data age out cleanly.
- Write throughput still not enough → queue writes (Kafka, SQS), let Postgres consume asynchronously. At this point you've left typical web app territory.
Quiz
Q1. You partition an events table by month. A developer runs SELECT * FROM events WHERE user_id = 99 — no date filter. What happens?
Postgres scans every partition. Without a filter on the partition key (created_at),
the planner has no way to eliminate any partition. It generates a parallel sequential scan across all of them —
potentially worse than an unpartitioned table, because the planner overhead for each partition adds up.
This is called partition pruning failure. Fix: add user_id as a secondary index on each
partition, or redesign to include a date constraint in the query. If the access pattern is always by
user_id with no time bound, a hash partition on user_id would be a better choice.
Q2. You're importing 500,000 rows from a CSV. A junior dev writes a loop with one INSERT per row. Name two faster approaches and why each is faster.
Batch INSERT (executemany / multi-row VALUES): Reduces 500,000 round trips to ~500. Each batch is one network call, one transaction, one WAL flush. The per-row overhead (parsing, planning, committing) drops from 500,000× to 500×.
COPY: Streams data directly into the table without going through the query planner at all. No per-row parsing, no constraint checks mid-stream (they run at the end), no WAL record per row — WAL is written in bulk. Typically 3–10× faster than batched INSERT for bulk loads.
Q3. Two servers simultaneously run an UPSERT for page_id = 5 where views = 10. Is this safe?
Yes, it's safe — and the result is deterministic. ON CONFLICT DO UPDATE is atomic at the row level. The first server to execute acquires a row lock. The second server waits. When the first commits, the second sees the updated row and applies its own increment.
The result is always views = 12 — not 11 (lost update) and not a duplicate key error. This is precisely what makes UPSERT valuable over SELECT-then-INSERT.
The caveat: at extreme concurrency with many processes all trying to update the same row simultaneously, this serialisation causes lock queuing and throughput drops. That's when you move to the batch-flush pattern.
Q4. A payment service upserts a payment_status table. Under load, deadlocks appear in logs. The UPSERT looks correct individually. What's the likely cause?
Multiple rows being upserted in different orders across concurrent transactions. If transaction A upserts rows [payment_1, payment_2] and transaction B upserts [payment_2, payment_1], they each acquire a lock on their first row and then wait on the second — classic deadlock cycle.
The fix: ensure all transactions upsert rows in a consistent order (e.g., sorted by payment_id ascending). Consistent ordering eliminates the cycle — transaction B will always wait on payment_1 first, which transaction A holds and will release before moving to payment_2.
# Sort before batch upsert to prevent deadlock
payments.sort(key=lambda p: p['id'])
await upsert_batch(payments)
Part 6 done. Next up — Part 7: Transactions & Concurrency. Isolation levels, what they actually protect against, deadlocks, and the race condition hiding in your application right now.