AWS Developer Tools & CI/CD
Fault Injection Simulator
Chaos engineering service to run controlled fault injection experiments on AWS
AWS Fault Injection Simulator (FIS) is a managed chaos engineering service that enables you to run controlled fault injection experiments on AWS infrastructure - injecting CPU stress, killing instances, throttling API calls, disrupting networks, and more - to validate that your systems are resilient before real failures occur. It provides safety mechanisms including stop conditions and rollback actions to prevent experiments from cascading into actual outages.
Core Concepts: Experiments, Actions, Targets, and Stop Conditions
FIS experiments are defined as templates that describe what fault to inject, where to inject it, and when to stop automatically.
| Concept | Description | Example |
|---|---|---|
| Experiment template | Reusable definition of an experiment | Kill 50% of EC2 instances in ASG |
| Action | The fault to inject (AWS-provided or custom) | aws:ec2:terminate-instances |
| Target | Which resources to apply the action to | EC2 instances tagged with Environment=prod |
| Selection mode | How many resources to select from target | COUNT(2), PERCENT(50), ALL |
| Stop condition | CloudWatch alarm that halts the experiment if breached | Alarm: Error5xxRate > 5% |
| Duration | How long to maintain the fault | PT2M (2 minutes) in ISO 8601 format |
| Start after | Dependency between actions (run this action after another) | Start network disruption after CPU stress begins |
Stop conditions are critical safety mechanisms. Always define a CloudWatch alarm (e.g., on error rate, latency, or custom metric) as a stop condition. If the alarm fires during the experiment, FIS immediately rolls back the fault. Running experiments without stop conditions is dangerous in production.
Available Fault Injection Actions
FIS provides pre-built actions across multiple AWS services:
| Category | Action | What It Does |
|---|---|---|
| EC2 | aws:ec2:terminate-instances | Terminates selected EC2 instances |
| EC2 | aws:ec2:stop-instances | Stops (not terminates) instances |
| EC2 | aws:ec2:reboot-instances | Reboots instances |
| EC2 | aws:ec2:send-spot-instance-interruptions | Simulates spot interruption notice + termination |
| ECS | aws:ecs:drain-container-instances | Drains tasks from container instances |
| ECS | aws:ecs:stop-task | Stops ECS tasks |
| EKS | aws:eks:terminate-nodegroup-instances | Terminates nodes in a node group |
| RDS | aws:rds:failover-db-cluster | Triggers RDS failover to replica |
| RDS | aws:rds:reboot-db-instances | Reboots DB instances |
| Network | aws:network:disrupt-connectivity | Blocks network traffic (ACL injection) |
| Network | aws:network:route-traffic-through-middlebox | Route traffic through an intermediary |
| SSM | aws:ssm:send-command | Run SSM document (CPU/memory stress, kill processes) |
| CloudWatch | aws:cloudwatch:assert-alarm-state | Assert an alarm is in expected state |
For OS-level faults (CPU stress, memory pressure, disk fill, process kill), FIS uses SSM Run Command to execute stress-ng or custom scripts on EC2 instances. This requires the SSM agent and appropriate IAM permissions.
# Example: CPU stress via FIS SSM action in experiment template
{
"actions": {
"cpu-stress": {
"actionId": "aws:ssm:send-command",
"parameters": {
"documentArn": "arn:aws:ssm:us-east-1::document/AWSFIS-Run-CPU-Stress",
"documentParameters": "{\"CPU\":\"0\",\"DurationSeconds\":\"120\",\"InstallDependencies\":\"True\"}",
"duration": "PT3M"
},
"targets": { "Instances": "targetInstances" }
}
}
}Designing Safe Chaos Experiments
Effective chaos engineering follows a structured methodology. Randomly injecting faults without a hypothesis and monitoring plan can cause real outages without generating useful data.
| Step | Activity | FIS Mechanism |
|---|---|---|
| 1. Define hypothesis | State expected behavior during fault | Document as experiment description |
| 2. Define steady state | Identify metrics that define normal operation | CloudWatch alarms and dashboards |
| 3. Set stop conditions | Define thresholds that halt experiment | Stop conditions (CloudWatch alarms) |
| 4. Start small | Run in dev/staging before production | Target filters by tag (Environment=dev) |
| 5. Increase blast radius | Gradually expand scope | Increase PERCENT from 10% to 25% to 50% |
| 6. Observe and record | Monitor system behavior during fault | CloudWatch, X-Ray, application logs |
| 7. Fix weaknesses | Improve system based on findings | Engineering work between experiments |
Use resource tags to scope experiments. Tag test-eligible resources with FISTarget=true and configure FIS targets to only select resources with that tag. This prevents experiments from accidentally touching resources that should not be disrupted.
IAM Permissions and the FIS Service Role
FIS requires an IAM role that grants it permission to perform fault actions on your resources. The principle of least privilege is especially important here - the role should only have permissions for the specific faults you plan to run.
# Example FIS service role permissions for EC2 and RDS experiments
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:TerminateInstances",
"ec2:StopInstances",
"ec2:RebootInstances"
],
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"StringEquals": {"ec2:ResourceTag/FISTarget": "true"}
}
},
{
"Effect": "Allow",
"Action": [
"rds:FailoverDBCluster",
"rds:RebootDBInstance"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:SendCommand",
"ssm:GetCommandInvocation"
],
"Resource": "*"
}
]
}The trust policy for the FIS service role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {"Service": "fis.amazonaws.com"},
"Action": "sts:AssumeRole"
}
]
}Do not use overly broad FIS service roles in production environments. A role with ec2:TerminateInstances on "*" without resource tag conditions could terminate any instance in your account if the experiment targets are misconfigured. Always scope permissions with resource tag conditions.
Common Chaos Engineering Scenarios
These are the most common experiments run with FIS in production engineering teams:
| Scenario | Actions Used | What You Learn |
|---|---|---|
| ASG instance failure | Terminate 50% of ASG instances | Does ASG auto-scale? Does ALB reroute? |
| AZ failure simulation | Terminate all instances in one AZ | Is the app truly multi-AZ resilient? |
| RDS primary failover | aws:rds:failover-db-cluster | How long does failover take? Does app reconnect? |
| Spot interruption | aws:ec2:send-spot-instance-interruptions | Does app handle 2-minute warning correctly? |
| CPU pressure | SSM AWSFIS-Run-CPU-Stress | Does auto-scaling trigger? Does app degrade gracefully? |
| Network latency | aws:network:disrupt-connectivity | Does circuit breaker trip? Do timeouts work? |
| ECS task failure | aws:ecs:stop-task | Does ECS restart tasks? Does service stay healthy? |
| Dependency throttling | aws:ssm:send-command + tc/iptables | Does app handle downstream throttling? |
FIS integrates with AWS Organizations to run experiments across multiple accounts. This is useful for multi-account environments where you want to test cross-account dependencies or run the same experiment across all member accounts in a controlled way.
Interview Focus Points
- 1What is chaos engineering and why is it important for production reliability?
- 2What are stop conditions in FIS and why are they critical to configure?
- 3How do you scope a FIS experiment to only affect specific resources?
- 4What is the methodology for designing a safe chaos experiment from hypothesis to fix?
- 5How would you simulate an Availability Zone failure using FIS?
- 6What IAM permissions does FIS need and how do you scope them safely?
- 7How does FIS integrate with SSM to inject OS-level faults?
- 8How would you test that your application handles RDS failover correctly?
- 9What is the difference between stopping an experiment and rolling back an action in FIS?
- 10How do you prevent chaos experiments from accidentally affecting production resources they should not touch?