Staff Prep 27: Distributed Systems — CAP Theorem, Eventual Consistency & Conflict Resolution
Every engineer has heard of CAP theorem. Most recite it wrong. The real version: during a network partition you must choose between Consistency and Availability. You do not get both. The third property, Partition Tolerance, is not optional — you either handle partitions or your system goes down when the network misbehaves. Here is how to reason about distributed consistency at Staff level.
The real CAP theorem
Eric Brewer's original claim: in the presence of a network partition, a distributed system can guarantee either Consistency (every read returns the most recent write or an error) or Availability (every request receives a non-error response, though it may be stale). Not both.
The key phrase is "in the presence of a partition." Without a partition, you can have both. Most of the time your network is fine and this is not relevant. CAP only forces a trade-off when nodes cannot communicate.
What this means practically:
- CP systems (Postgres, ZooKeeper, HBase) — during a partition, they refuse writes or return errors rather than serve stale data
- AP systems (Cassandra, DynamoDB, CouchDB) — during a partition, they serve potentially stale data rather than return errors
Consistency models (there is a spectrum)
CAP's "consistency" means linearisability — the strongest model. But there is a whole spectrum:
Strongest → Weakest
Linearisability — reads always see latest write, globally ordered
Sequential — operations appear in some global order (not necessarily real-time)
Causal — causally related operations are seen in order
Read-your-writes — you always see your own writes
Eventual — given no new writes, all replicas converge eventually
Most real systems sit somewhere in the middle. Postgres replication with synchronous_commit=on is linearisable within one shard. Postgres with async replicas is read-your-writes for the primary, eventual for replicas.
Eventual consistency in practice
Eventual consistency means: if you stop writing, all replicas will eventually agree. It says nothing about how long that takes or what you see in the meantime.
The classic problem: two users update the same record on different nodes during a partition. When the partition heals, both updates exist. Who wins?
Three common resolution strategies:
1. Last Write Wins (LWW)
— Use wall-clock timestamp. Highest timestamp wins.
— Problem: clocks drift. You can lose writes.
— Used by: Cassandra (default), DynamoDB
2. Version Vectors (Vector Clocks)
— Each node tracks a counter for each other node.
— [nodeA: 3, nodeB: 2] vs [nodeA: 2, nodeB: 3] = concurrent conflict
— Surface the conflict to the application to resolve.
— Used by: Riak, DynamoDB (conditional writes)
3. CRDTs (Conflict-free Replicated Data Types)
— Data structures mathematically guaranteed to merge without conflicts.
— Examples: G-Counter (increment only), OR-Set (add/remove set), LWW-Register
— Used by: Redis (some types), collaborative editors (Figma, Notion)
The PACELC extension
CAP only covers partition scenarios. PACELC adds: even when there is no partition (E), there is a trade-off between Latency (L) and Consistency (C).
Example: Postgres synchronous replication gives you consistency but adds write latency (must wait for replica ack). Async replication is faster but risks data loss if the primary fails before replication.
-- Synchronous replication: consistent but slower
synchronous_commit = on -- wait for replica WAL write
synchronous_standby_names = '*'
-- Asynchronous replication: faster but eventual
synchronous_commit = off -- return success immediately
-- Risk: up to wal_writer_delay ms of data loss on crash
Practical patterns for distributed state
Sagas for distributed transactions: When you need multi-step operations across services with no distributed transaction support, use a saga. Each step has a compensating action. On failure, roll back by executing compensating actions in reverse.
# Choreography saga — services react to events
# Step 1: Order service creates order (PENDING)
# Step 2: Payment service charges card → emits PAYMENT_SUCCESS
# Step 3: Inventory service reserves stock → emits INVENTORY_RESERVED
# Step 4: Order service marks CONFIRMED
# On PAYMENT_FAILED: order service cancels → emits ORDER_CANCELLED
# On INVENTORY_FAILED: payment service refunds → emits PAYMENT_REFUNDED
Idempotency keys: When retrying operations across unreliable networks, use idempotency keys to deduplicate. The server stores the key and result — if the same key arrives again, return the stored result without re-executing.
@app.post("/payments")
async def create_payment(
request: PaymentRequest,
idempotency_key: str = Header(...)
):
# Check if we've already processed this key
existing = await redis.get(f"idem:{idempotency_key}")
if existing:
return json.loads(existing)
result = await process_payment(request)
await redis.setex(
f"idem:{idempotency_key}",
86400, # 24h TTL
json.dumps(result)
)
return result
When to choose CP vs AP
Use CP (strong consistency) when correctness is non-negotiable:
- Financial transactions, inventory counts, seat booking
- Auth tokens, permissions (never serve stale access control)
- Leader election, distributed locks
Use AP (eventual consistency) when availability matters more than perfect accuracy:
- Shopping cart (losing a cart item is bad; cart being unavailable is worse)
- Social feed, like counts, view counters
- Search indexes, recommendations, analytics
The Staff engineer move: design your system so that the components that need strong consistency (payments, inventory) use CP storage, while everything else uses AP storage. Do not pay the consistency tax everywhere.
Quiz: test your understanding
Before moving on, answer these in your head (or out loud):
- Your e-commerce site uses DynamoDB (AP). A customer adds an item to their cart on the US-East node. A network partition occurs. They immediately check their cart on US-West — it is empty. Is this expected? How would you mitigate this?
- You are designing a distributed counter for tracking API rate limits. You need the count to be accurate within 5%. Would you choose a CP or AP approach? What data structure would you use?
- Explain the difference between Last Write Wins and vector clocks. When does LWW silently lose data? Give a concrete example.
- Your payment service calls three downstream services in sequence. The third call fails after the first two succeed. How do you ensure the system returns to a consistent state? What information do you need to store, and where?
- A colleague proposes using Redis for session storage, saying "it is fast and available." You know Redis uses async replication by default. What specific failure scenario should you warn them about, and how would you mitigate it?
Next up — Part 28: Observability. Metrics, logs, traces — and how to actually use them to find production problems.