Staff Prep 25: Message Queues — Kafka vs SQS vs Redis Streams
Back to Part 24: Redis Patterns. Message queues decouple producers from consumers and let you process work asynchronously. Kafka, SQS and Redis Streams all solve that same problem with very different trade-offs in durability, throughput, operational overhead and replay. Picking the wrong one is the kind of mistake you pay for twelve months later when you're rewriting your event pipeline on a Saturday. Ask me how I know.
Core concepts: topics, partitions, consumer groups
Topic: A named stream of messages. Producers write to topics; consumers read from topics.
Partition: A topic is split into N partitions. Each partition is an ordered, append-only log. Partitions enable parallelism: each consumer in a group is assigned one or more partitions exclusively.
Consumer Group: Multiple consumers sharing the work of processing a topic. Each message is delivered to exactly one consumer in the group. Different groups each receive all messages independently.
Kafka: high-throughput, durable, replayable
from aiokafka import AIOKafkaProducer, AIOKafkaConsumer
import asyncio
import json
# Producer
async def produce_order_event(order_id: int, event_type: str, data: dict):
producer = AIOKafkaProducer(
bootstrap_servers="kafka:9092",
value_serializer=lambda v: json.dumps(v).encode(),
)
await producer.start()
try:
# Partition by order_id for ordering guarantees within an order
await producer.send_and_wait(
"order-events",
value={"order_id": order_id, "type": event_type, "data": data},
key=str(order_id).encode(), # same key = same partition = ordered
)
finally:
await producer.stop()
# Consumer with manual offset management
async def consume_order_events():
consumer = AIOKafkaConsumer(
"order-events",
bootstrap_servers="kafka:9092",
group_id="order-processor",
auto_offset_reset="earliest",
enable_auto_commit=False, # manual commit for exactly-once processing
value_deserializer=lambda v: json.loads(v.decode()),
)
await consumer.start()
try:
async for message in consumer:
try:
await process_order_event(message.value)
await consumer.commit() # commit AFTER processing (at-least-once)
except ProcessingError:
# Don't commit — message will be redelivered
pass
finally:
await consumer.stop()
Kafka delivery semantics
from aiokafka import AIOKafkaProducer
# At-most-once: commit before processing (message may be lost on crash)
async def at_most_once(consumer, message):
await consumer.commit() # commit first
await process(message) # if this crashes, message is lost
# At-least-once: commit after processing (message may be processed twice)
async def at_least_once(consumer, message):
await process(message) # process first
await consumer.commit() # if crash between process and commit: redelivered
# Exactly-once: producer transactions + idempotent consumer
producer = AIOKafkaProducer(
bootstrap_servers="kafka:9092",
transactional_id="my-producer-1", # enables transactions
enable_idempotence=True, # deduplication at producer level
)
await producer.start()
async with producer.transaction():
await producer.send("output-topic", value=processed_data)
# Only committed if the transaction succeeds
# Consumer must use isolation_level="read_committed" to see only committed messages
SQS: fully managed, simpler, limited
import boto3
import json
from typing import Optional
sqs = boto3.client("sqs", region_name="us-east-1")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/order-events"
DLQ_URL = "https://sqs.us-east-1.amazonaws.com/123456789/order-events-dlq"
# Producer
def publish_event(event: dict):
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps(event),
MessageGroupId=str(event["order_id"]), # FIFO queue: ordering per group
MessageDeduplicationId=event["idempotency_key"], # exactly-once delivery in FIFO
)
# Consumer
def consume_messages(max_messages: int = 10):
response = sqs.receive_message(
QueueUrl=QUEUE_URL,
MaxNumberOfMessages=max_messages,
WaitTimeSeconds=20, # long polling (reduces empty receives)
VisibilityTimeout=60, # message hidden from other consumers for 60s
)
for message in response.get("Messages", []):
try:
body = json.loads(message["Body"])
process_event(body)
# Delete after successful processing
sqs.delete_message(
QueueUrl=QUEUE_URL,
ReceiptHandle=message["ReceiptHandle"]
)
except Exception as e:
# Do NOT delete — message becomes visible again after VisibilityTimeout
# After maxReceiveCount failures, SQS moves to DLQ automatically
print(f"Processing failed: {e}")
Dead letter queues: handling persistent failures
import boto3
sqs = boto3.client("sqs")
# Configure DLQ: after 5 failed processing attempts, move to DLQ
sqs.set_queue_attributes(
QueueUrl=QUEUE_URL,
Attributes={
"RedrivePolicy": json.dumps({
"deadLetterTargetArn": "arn:aws:sqs:us-east-1:123456789:order-events-dlq",
"maxReceiveCount": "5",
})
}
)
# Monitor DLQ for alerts
def get_dlq_depth() -> int:
attrs = sqs.get_queue_attributes(
QueueUrl=DLQ_URL,
AttributeNames=["ApproximateNumberOfMessages"]
)
return int(attrs["Attributes"]["ApproximateNumberOfMessages"])
# Replay DLQ messages (after fixing the bug)
def redrive_dlq():
sqs.start_message_move_task(
SourceArn="arn:aws:sqs:us-east-1:123456789:order-events-dlq",
DestinationArn="arn:aws:sqs:us-east-1:123456789:order-events",
MaxNumberOfMessagesPerSecond=10,
)
Comparison: Kafka vs SQS vs Redis streams
Kafka does 1M+ msg/s, retains messages for replay and has mature consumer group semantics. It is also genuinely painful to operate yourself. Use it when you need event sourcing, analytics pipelines or cross-team event buses, and preferably on a managed offering like MSK or Confluent Cloud unless you have a dedicated infra team.
SQS handles up to 3,000 msg/s standard (higher with FIFO), is fully managed, has no replay, caps retention at 14 days and gives you dead letter queues for free. The ops story is the simplest of the three, and for most web apps that's all you need.
Redis Streams sits around 100k msg/s with configurable retention and consumer groups. If Redis is already in your stack, it's a very low-friction choice for moderate throughput. I tend to start here and graduate only when the numbers force me to.
Quiz: test your understanding
Before moving on, answer these in your head (or out loud):
- Kafka retains messages for days or weeks. SQS retains for up to 14 days. What capability does retention enable that a traditional queue does not?
- You have a Kafka consumer processing payment events. The consumer processes the message but crashes before committing the offset. What happens on restart? Is this safe?
- What is a Dead Letter Queue? When should a message be sent to the DLQ instead of retried? How do you handle DLQ messages after fixing the underlying bug?
- Your order processing pipeline has 5 consumers in a Kafka consumer group reading from a topic with 3 partitions. How many consumers are actually working at any given time? How many are idle?
- You need ordering guarantees for all events related to the same order. How do you achieve this in Kafka? In SQS?
Next up: Part 26: Load Balancing Strategies. L4 vs L7, consistent hashing, health checks, and sticky session trade-offs.