AWS Messaging & Integration
Step Functions
Visual workflow orchestration for microservices and distributed applications
AWS Step Functions is a serverless visual workflow service that orchestrates AWS services and Lambda functions into multi-step, stateful workflows. It handles the state management, error handling, retry logic, and branching that would otherwise require complex custom code. Step Functions is essential for building reliable distributed systems, data processing pipelines, and microservice orchestration on AWS.
State Machines, States, and Amazon States Language
Step Functions workflows are defined as state machines using Amazon States Language (ASL), a JSON-based language. Each step in the workflow is a state, and states are linked by transitions.
| State Type | Purpose | Example Use Case |
|---|---|---|
| Task | Invoke a Lambda, ECS task, DynamoDB operation, or 200+ integrations | Process payment, validate input |
| Choice | Branch execution based on input data (if/else) | Route based on order type |
| Parallel | Execute multiple branches concurrently | Run credit check + inventory check simultaneously |
| Map | Iterate over an array, running states for each element | Process each item in an order |
| Wait | Pause execution for a fixed time or until timestamp | Wait 5 minutes before retry |
| Pass | Pass input to output with optional transformation | Shape data between steps |
| Succeed | Terminate execution as success | End of a branch |
| Fail | Terminate execution as failure with error and cause | Unrecoverable error path |
// Minimal state machine definition (ASL)
{
"Comment": "Order processing workflow",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:validate-order",
"Next": "CheckInventory",
"Retry": [{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2
}],
"Catch": [{
"ErrorEquals": ["ValidationError"],
"Next": "OrderFailed"
}]
},
"CheckInventory": {
"Type": "Task",
"Resource": "arn:aws:states:::dynamodb:getItem",
"Parameters": {
"TableName": "inventory",
"Key": {"productId": {"S.$": "$.productId"}}
},
"Next": "ProcessPayment"
},
"ProcessPayment": {"Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123:function:process-payment", "End": true},
"OrderFailed": {"Type": "Fail", "Error": "OrderFailed", "Cause": "Validation failed"}
}
}Standard vs Express Workflows
Step Functions offers two workflow types with fundamentally different execution models, pricing, and use cases.
| Feature | Standard Workflow | Express Workflow |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Execution model | Exactly-once execution | At-least-once execution |
| Execution history | Stored in Step Functions (90 days) | Must push to CloudWatch Logs |
| Pricing | $0.025 per 1,000 state transitions | $1 per million executions + duration |
| Throughput | 2,000 executions/sec (increasable) | Unlimited (designed for high volume) |
| Use cases | Long-running, auditable workflows | High-volume, short-duration workflows |
| Async Express | N/A | Start and poll for result - no built-in callback |
| Sync Express | N/A | Start and wait for result (up to 5 min) |
Standard workflows charge per state transition, which adds up quickly for Map states iterating over thousands of items. A Map state with 10,000 items running 5 states each = 50,000 transitions = $1.25. For high-volume iteration, Express workflows are far cheaper.
SDK Integrations and Optimistic Locking
Step Functions can directly call over 200 AWS services without writing Lambda functions as glue code. This is called SDK integrations or optimistic locking integrations.
| Integration Pattern | Behavior | Use When |
|---|---|---|
| Request-Response (.sync:2) | Calls service, waits for job to complete | ECS tasks, Glue jobs, SageMaker training |
| Request-Response (default) | Calls service, immediately moves to next state | Fire-and-forget, async work |
| Wait for callback (.waitForTaskToken) | Pauses until callback with taskToken is sent | Human approval, async third-party API |
# Wait for a human approval callback pattern
# In your state machine:
"WaitForApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters": {
"FunctionName": "send-approval-email",
"Payload": {
"taskToken.$": "$$.Task.Token",
"orderId.$": "$.orderId"
}
},
"HeartbeatSeconds": 86400
}
# In your approval Lambda or HTTP handler:
aws stepfunctions send-task-success \
--task-token "AQCEAAAAKgAAAAMAAAAAAAAAATVm..." \
--task-output '{"approved": true, "approvedBy": "manager@co.com"}'The waitForTaskToken pattern is powerful for human-in-the-loop workflows. The task token should be sent via SES email with approve/reject links. The workflow pauses indefinitely (up to 1 year for Standard) until the callback is received.
Error Handling, Retries, and Compensation
Step Functions has built-in retry and catch mechanisms at the state level. This is one of its most powerful features - you define retry behavior declaratively instead of in code.
| Error Type | Description | Example |
|---|---|---|
| States.ALL | Catch all errors | Catch-all fallback |
| States.TaskFailed | Task threw an exception | Lambda threw an error |
| States.Timeout | Task exceeded TimeoutSeconds | Lambda took too long |
| States.HeartbeatTimeout | No heartbeat received | ECS task stopped sending heartbeats |
| Custom errors | Errors thrown by your Lambda code | 'InsufficientFundsError' |
// Retry with exponential backoff + catch for specific errors
"ChargeCard": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:charge-card",
"TimeoutSeconds": 30,
"Retry": [
{
"ErrorEquals": ["TransientError", "States.Timeout"],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2,
"JitterStrategy": "FULL"
}
],
"Catch": [
{
"ErrorEquals": ["InsufficientFundsError"],
"Next": "NotifyCustomerInsufficient",
"ResultPath": "$.error"
},
{
"ErrorEquals": ["States.ALL"],
"Next": "HandleUnexpectedError"
}
],
"Next": "SendConfirmation"
}Always use JitterStrategy: FULL on retry blocks to prevent thundering herd problems when many parallel executions fail simultaneously and retry at the same interval.
Pricing and When to Use Step Functions
| Workflow Type | Pricing Component | Price |
|---|---|---|
| Standard | State transitions | $0.025 per 1,000 transitions |
| Standard | Free tier | 4,000 transitions/month free |
| Express (async) | Executions | $1.00 per million requests |
| Express (async) | Duration | $0.00001667 per GB-second |
| Express (async) | Free tier | 1 million executions/month free |
Step Functions adds cost but eliminates complex orchestration code, reduces Lambda timeouts for long workflows, provides built-in retry logic, and gives you a visual debugger. The value is in operational simplicity, not raw compute cost.
Do not use Step Functions for simple sequential Lambda chains where SQS + Lambda would work. The overhead - both in complexity and cost - is only justified when you need branching logic, parallel execution, wait states, error handling with retries, or workflows longer than 15 minutes.
Interview Focus Points
- 1When would you use Step Functions instead of a Lambda function calling other Lambdas directly?
- 2Explain the difference between Standard and Express workflows and when you'd choose each.
- 3How would you implement a human approval step in a Step Functions workflow?
- 4What is the waitForTaskToken pattern and how does it work?
- 5How would you handle a situation where one step in a 10-step workflow fails? Explain retry and catch.
- 6How does the Map state work and what are the cost implications for large arrays?
- 7What are SDK integrations in Step Functions and why do they reduce the need for Lambda?
- 8How would you implement a saga pattern (distributed transaction with compensation) using Step Functions?