AWS Messaging & Integration
SQS
Fully managed message queues for decoupling and scaling distributed systems
Amazon Simple Queue Service (SQS) is a fully managed message queuing service that enables decoupling of application components, allowing them to communicate asynchronously and scale independently. It provides a reliable buffer between producers and consumers, absorbing traffic spikes without losing messages. SQS is one of the oldest and most fundamental AWS services, forming the backbone of resilient distributed architectures.
How SQS Works: Queues, Visibility, and Polling
SQS is a pull-based queue. Consumers poll the queue to receive messages, process them, and then explicitly delete them. Until a consumer deletes the message, it remains in the queue and can be redelivered.
The visibility timeout is the core mechanism for preventing duplicate processing. When a consumer receives a message, the message becomes invisible to other consumers for the visibility timeout duration. If the consumer fails to delete it before the timeout expires, the message becomes visible again and another consumer can pick it up.
| Parameter | Default | Max | Notes |
|---|---|---|---|
| Visibility timeout | 30 seconds | 12 hours | Set to at least 6x your expected processing time |
| Message retention | 4 days | 14 days | Messages deleted after retention period |
| Max message size | - | 256 KB | Use S3 + SQS Extended Client for larger payloads |
| Max receives before DLQ | - | 1000 | Set maxReceiveCount on redrive policy |
| Long polling wait time | 0 seconds | 20 seconds | Use 20s for cost savings |
| Delay queue | 0 seconds | 15 minutes | Delays message visibility to all consumers |
Standard vs FIFO Queues
Choosing between Standard and FIFO queues is one of the most common SQS architectural decisions. The tradeoffs are significant.
| Feature | Standard Queue | FIFO Queue |
|---|---|---|
| Message order | Best-effort (generally in order) | Strict FIFO per message group ID |
| Delivery guarantee | At-least-once (duplicates possible) | Exactly-once processing |
| Throughput | 3,000 messages/sec with batching (unlimited with high throughput mode) | 3,000 messages/sec with batching |
| Deduplication | Not supported | Content-based or deduplication ID (5 min window) |
| Naming | Any name | Must end in .fifo |
| SNS fan-out | Any SNS topic type | Only from SNS FIFO topics |
| Use cases | High-volume jobs, decoupling, buffering | Order processing, financial transactions, inventory |
FIFO queues process one message group at a time per consumer. If you have a single message group, scaling consumers does not increase throughput - only adding more message groups helps. Design your message group ID strategy carefully.
Dead-Letter Queues and Error Handling
A dead-letter queue (DLQ) is a separate SQS queue where messages are sent after failing processing a configurable number of times (maxReceiveCount). DLQs are essential for debugging and preventing poison pill messages from blocking a queue.
# Create a DLQ
aws sqs create-queue --queue-name my-service-dlq
# Get the DLQ ARN
DLQ_ARN=$(aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123/my-service-dlq \
--attribute-names QueueArn \
--query Attributes.QueueArn --output text)
# Set redrive policy on main queue (send to DLQ after 3 failures)
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123/my-service \
--attributes '{
"RedrivePolicy": "{\"deadLetterTargetArn\":\"'$DLQ_ARN'\",\"maxReceiveCount\":\"3\"}"
}'Set up CloudWatch alarms on DLQ depth (ApproximateNumberOfMessagesVisible > 0). A non-empty DLQ always means something needs attention. Use SQS DLQ redrive to replay messages back to the source queue after fixing the bug.
DLQ redrive (re-processing messages from DLQ) was added natively in 2021. Before that, you had to write scripts to move messages manually.
# Start DLQ redrive (replay messages from DLQ back to source queue)
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:us-east-1:123:my-service-dlq \
--destination-arn arn:aws:sqs:us-east-1:123:my-service \
--max-number-of-messages-per-second 10SQS as Lambda Event Source
Lambda can poll SQS as an event source mapping. Lambda manages the polling, batching, and concurrency scaling automatically. Understanding how this works is critical for building reliable serverless pipelines.
| Setting | Description | Recommendation |
|---|---|---|
| Batch size | Messages delivered per Lambda invocation (1-10000) | Start with 10, tune based on processing time |
| Batch window | Time to wait to fill a batch (0-300s) | Use for low-volume queues to reduce invocations |
| Concurrency scaling | Lambda scales up to 1000 concurrent executions (60/min ramp) | Set reserved concurrency to avoid throttling downstream |
| Error handling | Failed batches return to queue and retry up to visibility timeout | Use report-batch-item-failures for partial failures |
| Visibility timeout | Must be >= 6x Lambda timeout | Set queue visibility to 6x function timeout |
By default, if any message in a batch fails, the entire batch is returned to the queue and retried. This can cause already-processed messages to be processed again. Implement partial batch failure reporting (report-batch-item-failures) to only retry failed messages.
# Lambda handler with partial batch failure reporting
# In your Lambda function response, return failed message IDs:
{
"batchItemFailures": [
{"itemIdentifier": "message-id-that-failed"},
{"itemIdentifier": "another-failed-message-id"}
]
}SQS Pricing and Optimization
| Item | Cost | Notes |
|---|---|---|
| First 1 million requests/month | Free | Applies to Standard and FIFO |
| Standard queue requests | $0.40 per million | Each API call = one request |
| FIFO queue requests | $0.50 per million | 25% more than Standard |
| Data transfer | Free within same region | Cross-region transfer charged at EC2 rates |
Long polling is the most important cost optimization. With short polling (default), empty responses cost money. With long polling (WaitTimeSeconds=20), requests wait up to 20 seconds for a message, reducing empty responses dramatically.
# Receive messages with long polling (20 second wait)
aws sqs receive-message \
--queue-url https://sqs.us-east-1.amazonaws.com/123/my-queue \
--max-number-of-messages 10 \
--wait-time-seconds 20
# Set long polling at queue level (applies to all consumers)
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123/my-queue \
--attributes '{"ReceiveMessageWaitTimeSeconds":"20"}'Batch operations (SendMessageBatch, DeleteMessageBatch) process up to 10 messages per API call but count as a single request. Always batch when possible - it reduces costs by up to 10x and improves throughput.
Interview Focus Points
- 1Explain the SQS visibility timeout and what happens if a consumer crashes before deleting the message.
- 2What is a dead-letter queue and how would you use it in a production system?
- 3When would you choose FIFO over Standard SQS, and what are the throughput tradeoffs?
- 4How does SQS as a Lambda event source work? What happens when a batch fails?
- 5What is the difference between a message delay and a visibility timeout?
- 6How would you handle a poison pill message - a message that always fails processing?
- 7Explain the SNS fan-out to multiple SQS queues pattern and why it is used.
- 8What is long polling and why should you use it instead of short polling?
- 9How would you implement exactly-once processing with SQS Standard queues?
- 10What is the maximum SQS message size and how do you handle payloads larger than that?