SQS

Fully managed message queues for decoupling and scaling distributed systems

Amazon Simple Queue Service (SQS) is a fully managed message queuing service that enables decoupling of application components, allowing them to communicate asynchronously and scale independently. It provides a reliable buffer between producers and consumers, absorbing traffic spikes without losing messages. SQS is one of the oldest and most fundamental AWS services, forming the backbone of resilient distributed architectures.

How SQS Works: Queues, Visibility, and Polling

SQS is a pull-based queue. Consumers poll the queue to receive messages, process them, and then explicitly delete them. Until a consumer deletes the message, it remains in the queue and can be redelivered.

The visibility timeout is the core mechanism for preventing duplicate processing. When a consumer receives a message, the message becomes invisible to other consumers for the visibility timeout duration. If the consumer fails to delete it before the timeout expires, the message becomes visible again and another consumer can pick it up.

Parameter	Default	Max	Notes
Visibility timeout	30 seconds	12 hours	Set to at least 6x your expected processing time
Message retention	4 days	14 days	Messages deleted after retention period
Max message size	-	256 KB	Use S3 + SQS Extended Client for larger payloads
Max receives before DLQ	-	1000	Set maxReceiveCount on redrive policy
Long polling wait time	0 seconds	20 seconds	Use 20s for cost savings
Delay queue	0 seconds	15 minutes	Delays message visibility to all consumers

Standard vs FIFO Queues

Choosing between Standard and FIFO queues is one of the most common SQS architectural decisions. The tradeoffs are significant.

Feature	Standard Queue	FIFO Queue
Message order	Best-effort (generally in order)	Strict FIFO per message group ID
Delivery guarantee	At-least-once (duplicates possible)	Exactly-once processing
Throughput	3,000 messages/sec with batching (unlimited with high throughput mode)	3,000 messages/sec with batching
Deduplication	Not supported	Content-based or deduplication ID (5 min window)
Naming	Any name	Must end in .fifo
SNS fan-out	Any SNS topic type	Only from SNS FIFO topics
Use cases	High-volume jobs, decoupling, buffering	Order processing, financial transactions, inventory

⚠️

FIFO queues process one message group at a time per consumer. If you have a single message group, scaling consumers does not increase throughput - only adding more message groups helps. Design your message group ID strategy carefully.

Dead-Letter Queues and Error Handling

A dead-letter queue (DLQ) is a separate SQS queue where messages are sent after failing processing a configurable number of times (maxReceiveCount). DLQs are essential for debugging and preventing poison pill messages from blocking a queue.

bash

# Create a DLQ
aws sqs create-queue --queue-name my-service-dlq

# Get the DLQ ARN
DLQ_ARN=$(aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123/my-service-dlq \
  --attribute-names QueueArn \
  --query Attributes.QueueArn --output text)

# Set redrive policy on main queue (send to DLQ after 3 failures)
aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123/my-service \
  --attributes '{
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"'$DLQ_ARN'\",\"maxReceiveCount\":\"3\"}"
  }'

💡

Set up CloudWatch alarms on DLQ depth (ApproximateNumberOfMessagesVisible > 0). A non-empty DLQ always means something needs attention. Use SQS DLQ redrive to replay messages back to the source queue after fixing the bug.

DLQ redrive (re-processing messages from DLQ) was added natively in 2021. Before that, you had to write scripts to move messages manually.

bash

# Start DLQ redrive (replay messages from DLQ back to source queue)
aws sqs start-message-move-task \
  --source-arn arn:aws:sqs:us-east-1:123:my-service-dlq \
  --destination-arn arn:aws:sqs:us-east-1:123:my-service \
  --max-number-of-messages-per-second 10

SQS as Lambda Event Source

Lambda can poll SQS as an event source mapping. Lambda manages the polling, batching, and concurrency scaling automatically. Understanding how this works is critical for building reliable serverless pipelines.

Setting	Description	Recommendation
Batch size	Messages delivered per Lambda invocation (1-10000)	Start with 10, tune based on processing time
Batch window	Time to wait to fill a batch (0-300s)	Use for low-volume queues to reduce invocations
Concurrency scaling	Lambda scales up to 1000 concurrent executions (60/min ramp)	Set reserved concurrency to avoid throttling downstream
Error handling	Failed batches return to queue and retry up to visibility timeout	Use report-batch-item-failures for partial failures
Visibility timeout	Must be >= 6x Lambda timeout	Set queue visibility to 6x function timeout

⚠️

By default, if any message in a batch fails, the entire batch is returned to the queue and retried. This can cause already-processed messages to be processed again. Implement partial batch failure reporting (report-batch-item-failures) to only retry failed messages.

bash

# Lambda handler with partial batch failure reporting
# In your Lambda function response, return failed message IDs:
{
  "batchItemFailures": [
    {"itemIdentifier": "message-id-that-failed"},
    {"itemIdentifier": "another-failed-message-id"}
  ]
}

SQS Pricing and Optimization

Item	Cost	Notes
First 1 million requests/month	Free	Applies to Standard and FIFO
Standard queue requests	$0.40 per million	Each API call = one request
FIFO queue requests	$0.50 per million	25% more than Standard
Data transfer	Free within same region	Cross-region transfer charged at EC2 rates

Long polling is the most important cost optimization. With short polling (default), empty responses cost money. With long polling (WaitTimeSeconds=20), requests wait up to 20 seconds for a message, reducing empty responses dramatically.

bash

# Receive messages with long polling (20 second wait)
aws sqs receive-message \
  --queue-url https://sqs.us-east-1.amazonaws.com/123/my-queue \
  --max-number-of-messages 10 \
  --wait-time-seconds 20

# Set long polling at queue level (applies to all consumers)
aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123/my-queue \
  --attributes '{"ReceiveMessageWaitTimeSeconds":"20"}'

💡

Batch operations (SendMessageBatch, DeleteMessageBatch) process up to 10 messages per API call but count as a single request. Always batch when possible - it reduces costs by up to 10x and improves throughput.

🎯

Interview Focus Points

1Explain the SQS visibility timeout and what happens if a consumer crashes before deleting the message.
2What is a dead-letter queue and how would you use it in a production system?
3When would you choose FIFO over Standard SQS, and what are the throughput tradeoffs?
4How does SQS as a Lambda event source work? What happens when a batch fails?
5What is the difference between a message delay and a visibility timeout?
6How would you handle a poison pill message - a message that always fails processing?
7Explain the SNS fan-out to multiple SQS queues pattern and why it is used.
8What is long polling and why should you use it instead of short polling?
9How would you implement exactly-once processing with SQS Standard queues?
10What is the maximum SQS message size and how do you handle payloads larger than that?