Step Functions

Visual workflow orchestration for microservices and distributed applications

AWS Step Functions is a serverless visual workflow service that orchestrates AWS services and Lambda functions into multi-step, stateful workflows. It handles the state management, error handling, retry logic, and branching that would otherwise require complex custom code. Step Functions is essential for building reliable distributed systems, data processing pipelines, and microservice orchestration on AWS.

State Machines, States, and Amazon States Language

Step Functions workflows are defined as state machines using Amazon States Language (ASL), a JSON-based language. Each step in the workflow is a state, and states are linked by transitions.

State Type	Purpose	Example Use Case
Task	Invoke a Lambda, ECS task, DynamoDB operation, or 200+ integrations	Process payment, validate input
Choice	Branch execution based on input data (if/else)	Route based on order type
Parallel	Execute multiple branches concurrently	Run credit check + inventory check simultaneously
Map	Iterate over an array, running states for each element	Process each item in an order
Wait	Pause execution for a fixed time or until timestamp	Wait 5 minutes before retry
Pass	Pass input to output with optional transformation	Shape data between steps
Succeed	Terminate execution as success	End of a branch
Fail	Terminate execution as failure with error and cause	Unrecoverable error path

bash

// Minimal state machine definition (ASL)
{
  "Comment": "Order processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123:function:validate-order",
      "Next": "CheckInventory",
      "Retry": [{
        "ErrorEquals": ["States.TaskFailed"],
        "IntervalSeconds": 2,
        "MaxAttempts": 3,
        "BackoffRate": 2
      }],
      "Catch": [{
        "ErrorEquals": ["ValidationError"],
        "Next": "OrderFailed"
      }]
    },
    "CheckInventory": {
      "Type": "Task",
      "Resource": "arn:aws:states:::dynamodb:getItem",
      "Parameters": {
        "TableName": "inventory",
        "Key": {"productId": {"S.$": "$.productId"}}
      },
      "Next": "ProcessPayment"
    },
    "ProcessPayment": {"Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123:function:process-payment", "End": true},
    "OrderFailed": {"Type": "Fail", "Error": "OrderFailed", "Cause": "Validation failed"}
  }
}

Standard vs Express Workflows

Step Functions offers two workflow types with fundamentally different execution models, pricing, and use cases.

Feature	Standard Workflow	Express Workflow
Max duration	1 year	5 minutes
Execution model	Exactly-once execution	At-least-once execution
Execution history	Stored in Step Functions (90 days)	Must push to CloudWatch Logs
Pricing	$0.025 per 1,000 state transitions	$1 per million executions + duration
Throughput	2,000 executions/sec (increasable)	Unlimited (designed for high volume)
Use cases	Long-running, auditable workflows	High-volume, short-duration workflows
Async Express	N/A	Start and poll for result - no built-in callback
Sync Express	N/A	Start and wait for result (up to 5 min)

⚠️

Standard workflows charge per state transition, which adds up quickly for Map states iterating over thousands of items. A Map state with 10,000 items running 5 states each = 50,000 transitions = $1.25. For high-volume iteration, Express workflows are far cheaper.

SDK Integrations and Optimistic Locking

Step Functions can directly call over 200 AWS services without writing Lambda functions as glue code. This is called SDK integrations or optimistic locking integrations.

Integration Pattern	Behavior	Use When
Request-Response (.sync:2)	Calls service, waits for job to complete	ECS tasks, Glue jobs, SageMaker training
Request-Response (default)	Calls service, immediately moves to next state	Fire-and-forget, async work
Wait for callback (.waitForTaskToken)	Pauses until callback with taskToken is sent	Human approval, async third-party API

bash

# Wait for a human approval callback pattern
# In your state machine:
"WaitForApproval": {
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
  "Parameters": {
    "FunctionName": "send-approval-email",
    "Payload": {
      "taskToken.$": "$$.Task.Token",
      "orderId.$": "$.orderId"
    }
  },
  "HeartbeatSeconds": 86400
}

# In your approval Lambda or HTTP handler:
aws stepfunctions send-task-success \
  --task-token "AQCEAAAAKgAAAAMAAAAAAAAAATVm..." \
  --task-output '{"approved": true, "approvedBy": "manager@co.com"}'

💡

The waitForTaskToken pattern is powerful for human-in-the-loop workflows. The task token should be sent via SES email with approve/reject links. The workflow pauses indefinitely (up to 1 year for Standard) until the callback is received.

Error Handling, Retries, and Compensation

Step Functions has built-in retry and catch mechanisms at the state level. This is one of its most powerful features - you define retry behavior declaratively instead of in code.

Error Type	Description	Example
States.ALL	Catch all errors	Catch-all fallback
States.TaskFailed	Task threw an exception	Lambda threw an error
States.Timeout	Task exceeded TimeoutSeconds	Lambda took too long
States.HeartbeatTimeout	No heartbeat received	ECS task stopped sending heartbeats
Custom errors	Errors thrown by your Lambda code	'InsufficientFundsError'

bash

// Retry with exponential backoff + catch for specific errors
"ChargeCard": {
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:123:function:charge-card",
  "TimeoutSeconds": 30,
  "Retry": [
    {
      "ErrorEquals": ["TransientError", "States.Timeout"],
      "IntervalSeconds": 1,
      "MaxAttempts": 3,
      "BackoffRate": 2,
      "JitterStrategy": "FULL"
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["InsufficientFundsError"],
      "Next": "NotifyCustomerInsufficient",
      "ResultPath": "$.error"
    },
    {
      "ErrorEquals": ["States.ALL"],
      "Next": "HandleUnexpectedError"
    }
  ],
  "Next": "SendConfirmation"
}

💡

Always use JitterStrategy: FULL on retry blocks to prevent thundering herd problems when many parallel executions fail simultaneously and retry at the same interval.

Pricing and When to Use Step Functions

Workflow Type	Pricing Component	Price
Standard	State transitions	$0.025 per 1,000 transitions
Standard	Free tier	4,000 transitions/month free
Express (async)	Executions	$1.00 per million requests
Express (async)	Duration	$0.00001667 per GB-second
Express (async)	Free tier	1 million executions/month free

Step Functions adds cost but eliminates complex orchestration code, reduces Lambda timeouts for long workflows, provides built-in retry logic, and gives you a visual debugger. The value is in operational simplicity, not raw compute cost.

⚠️

Do not use Step Functions for simple sequential Lambda chains where SQS + Lambda would work. The overhead - both in complexity and cost - is only justified when you need branching logic, parallel execution, wait states, error handling with retries, or workflows longer than 15 minutes.

🎯

Interview Focus Points

1When would you use Step Functions instead of a Lambda function calling other Lambdas directly?
2Explain the difference between Standard and Express workflows and when you'd choose each.
3How would you implement a human approval step in a Step Functions workflow?
4What is the waitForTaskToken pattern and how does it work?
5How would you handle a situation where one step in a 10-step workflow fails? Explain retry and catch.
6How does the Map state work and what are the cost implications for large arrays?
7What are SDK integrations in Step Functions and why do they reduce the need for Lambda?
8How would you implement a saga pattern (distributed transaction with compensation) using Step Functions?