ArchitectureStaff

Staff Prep 05: Scaling Postgres Writes — Batching, UPSERT & Partitioning

April 4, 20269 min readPART 08 / 19

Back to Part 04: Scaling Reads. Reads scale horizontally with replicas. Writes don't. They all go to the primary, so when that primary is write-saturated you have fewer levers. Batch more, write less, and partition to distribute I/O.

(Worth saying upfront: most "Postgres can't handle our write volume" complaints I've heard turned out to be one ORM call per row in a tight loop, not anything Postgres was actually doing wrong. Check that first before reaching for partitioning.)

The cost of single-row inserts

Every INSERT in Postgres involves acquiring a row lock, writing to the WAL, writing to the heap, updating all indexes, and flushing WAL to disk on commit. For a single row insert followed by an immediate commit, you pay that flush cost. Typically 1 to 5 ms per commit on a fast SSD.

At 1,000 events/second, single-row inserts = 1,000 WAL flushes/second. That is the ceiling. At 1,001 events/second, you start queueing.

Batch inserts: the correct implementation

python

import asyncpg

# Slow: 1000 round trips, 1000 commits
async def insert_events_slow(events: list[dict]):
    for event in events:
        await conn.execute(
            "INSERT INTO events (user_id, type, payload) VALUES ($1, $2, $3)",
            event["user_id"], event["type"], event["payload"]
        )

# Fast: 1 round trip, 1 commit, bulk COPY-style insert
async def insert_events_fast(conn: asyncpg.Connection, events: list[dict]):
    # executemany uses a single prepared statement with multiple parameter sets
    await conn.executemany(
        "INSERT INTO events (user_id, type, payload) VALUES ($1, $2, $3)",
        [(e["user_id"], e["type"], e["payload"]) for e in events]
    )

# Fastest: COPY protocol (binary or CSV)
async def insert_events_copy(conn: asyncpg.Connection, events: list[dict]):
    await conn.copy_records_to_table(
        "events",
        records=[(e["user_id"], e["type"], e["payload"]) for e in events],
        columns=["user_id", "type", "payload"]
    )
    # COPY bypasses row-level logic but skips triggers and foreign key checks

Benchmark results on a typical cloud Postgres (db.t3.medium):

Single-row inserts: ~500 rows/second
executemany with 1000 rows: ~50,000 rows/second
COPY with 1000 rows: ~200,000 rows/second

UPSERT semantics: INSERT ON CONFLICT

UPSERT (INSERT with ON CONFLICT) is deceptively subtle. The wrong implementation causes silent lost updates and phantom conflicts.

sql

-- Table with a unique constraint on (user_id, date)
CREATE TABLE daily_stats (
    user_id     BIGINT NOT NULL,
    date        DATE NOT NULL,
    event_count INT NOT NULL DEFAULT 0,
    PRIMARY KEY (user_id, date)
);

-- Correct UPSERT: increment counter atomically
INSERT INTO daily_stats (user_id, date, event_count)
VALUES (42, '2026-04-04', 1)
ON CONFLICT (user_id, date)
DO UPDATE SET
    event_count = daily_stats.event_count + EXCLUDED.event_count;
-- EXCLUDED refers to the row that was proposed but conflicted

-- Wrong: use of SET event_count = EXCLUDED.event_count
-- This would overwrite the existing count with the new value
-- instead of incrementing it. Silent data loss.

-- DO NOTHING variant: idempotent insert
INSERT INTO processed_events (event_id, processed_at)
VALUES ('abc123', NOW())
ON CONFLICT (event_id) DO NOTHING;
-- Safe to call multiple times — only inserts once

Deferred constraints and bulk writes

When bulk-inserting rows that reference each other (parent-child in the same batch), foreign key constraints block you. Deferred constraints let you insert the batch first and check constraints at commit time.

sql

-- Define the FK as deferrable
ALTER TABLE order_items
    ADD CONSTRAINT fk_order_items_order
    FOREIGN KEY (order_id) REFERENCES orders(id)
    DEFERRABLE INITIALLY DEFERRED;

-- Now you can insert items before the parent order in the same transaction
BEGIN;
SET CONSTRAINTS fk_order_items_order DEFERRED;
INSERT INTO order_items (order_id, product_id, qty) VALUES (999, 1, 2);  -- order 999 doesn't exist yet
INSERT INTO orders (id, user_id, total) VALUES (999, 42, 29.99);          -- now it does
COMMIT;  -- FK check happens here — passes

Write buffering with background workers

For high-throughput event streams (clickstream, metrics, audit logs), writing synchronously to Postgres on every request is unnecessary. Buffer writes in memory or Redis, flush in batches.

python

import asyncio
import redis.asyncio as redis

r = redis.Redis(host="redis", port=6379)

# API: push event to Redis list (non-blocking, sub-millisecond)
async def record_event(user_id: int, event_type: str, payload: dict):
    event = {"user_id": user_id, "type": event_type, "payload": payload}
    await r.rpush("events:buffer", json.dumps(event))

# Background worker: flush buffer to Postgres every 5 seconds
async def flush_events_worker(db_pool):
    while True:
        await asyncio.sleep(5)
        # Atomically pop up to 1000 events
        pipe = r.pipeline()
        pipe.lrange("events:buffer", 0, 999)
        pipe.ltrim("events:buffer", 1000, -1)
        results = await pipe.execute()
        events = [json.loads(e) for e in results[0]]

        if events:
            async with db_pool.acquire() as conn:
                await conn.copy_records_to_table(
                    "events",
                    records=[(e["user_id"], e["type"], json.dumps(e["payload"])) for e in events],
                    columns=["user_id", "type", "payload"]
                )

Trade-off: if the process crashes between the Redis push and the Postgres flush, those buffered events are lost. Acceptable for analytics events; unacceptable for order creation. Know your data loss tolerance before applying this pattern.

Partitioning for write distribution

Table partitioning does not directly distribute writes across servers. It distributes them across physical storage segments. The benefit: inserts go into the current partition only, autovacuum works on smaller tables, and old partitions can be dropped (not deleted) in a near-instant operation.

sql

-- Range partition by month (common for time-series data)
CREATE TABLE events (
    id          BIGSERIAL,
    user_id     BIGINT NOT NULL,
    type        TEXT NOT NULL,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE events_2026_04 PARTITION OF events
    FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

CREATE TABLE events_2026_05 PARTITION OF events
    FOR VALUES FROM ('2026-05-01') TO ('2026-06-01');

-- Drop old data in O(1), no DELETE scan needed
DROP TABLE events_2024_01;  -- instant, regardless of row count

-- Automate with pg_partman extension or a cron job
-- Attach a new partition before the current one fills up

Quiz: test your understanding

Before moving on, answer these in your head (or out loud):

Why are single-row inserts with individual commits expensive in Postgres? What is the specific bottleneck?
In ON CONFLICT DO UPDATE SET count = EXCLUDED.count, what does EXCLUDED refer to? What is wrong with this approach for incrementing a counter?
You need to insert 10,000 rows per second into an events table. Walk through your options from slowest to fastest, with trade-offs for each.
What is the difference between deleting 5 million old events with DELETE WHERE created_at < '2024-01-01' vs dropping a partition? Why does one cause table bloat and the other does not?
Your write-buffering worker crashes. The Redis buffer has 50,000 events. What happens? How would you make this crash-safe?

Next up: Part 06: Transactions & Concurrency. Isolation levels, deadlocks, FOR UPDATE, and the race conditions that haunt production systems.

← PREV

Postgres Internals 103: Schemas & Partitioning