Fault Injection Simulator

Chaos engineering service to run controlled fault injection experiments on AWS

AWS Fault Injection Simulator (FIS) is a managed chaos engineering service that enables you to run controlled fault injection experiments on AWS infrastructure - injecting CPU stress, killing instances, throttling API calls, disrupting networks, and more - to validate that your systems are resilient before real failures occur. It provides safety mechanisms including stop conditions and rollback actions to prevent experiments from cascading into actual outages.

Core Concepts: Experiments, Actions, Targets, and Stop Conditions

FIS experiments are defined as templates that describe what fault to inject, where to inject it, and when to stop automatically.

Concept	Description	Example
Experiment template	Reusable definition of an experiment	Kill 50% of EC2 instances in ASG
Action	The fault to inject (AWS-provided or custom)	aws:ec2:terminate-instances
Target	Which resources to apply the action to	EC2 instances tagged with Environment=prod
Selection mode	How many resources to select from target	COUNT(2), PERCENT(50), ALL
Stop condition	CloudWatch alarm that halts the experiment if breached	Alarm: Error5xxRate > 5%
Duration	How long to maintain the fault	PT2M (2 minutes) in ISO 8601 format
Start after	Dependency between actions (run this action after another)	Start network disruption after CPU stress begins

💡

Stop conditions are critical safety mechanisms. Always define a CloudWatch alarm (e.g., on error rate, latency, or custom metric) as a stop condition. If the alarm fires during the experiment, FIS immediately rolls back the fault. Running experiments without stop conditions is dangerous in production.

Available Fault Injection Actions

FIS provides pre-built actions across multiple AWS services:

Category	Action	What It Does
EC2	aws:ec2:terminate-instances	Terminates selected EC2 instances
EC2	aws:ec2:stop-instances	Stops (not terminates) instances
EC2	aws:ec2:reboot-instances	Reboots instances
EC2	aws:ec2:send-spot-instance-interruptions	Simulates spot interruption notice + termination
ECS	aws:ecs:drain-container-instances	Drains tasks from container instances
ECS	aws:ecs:stop-task	Stops ECS tasks
EKS	aws:eks:terminate-nodegroup-instances	Terminates nodes in a node group
RDS	aws:rds:failover-db-cluster	Triggers RDS failover to replica
RDS	aws:rds:reboot-db-instances	Reboots DB instances
Network	aws:network:disrupt-connectivity	Blocks network traffic (ACL injection)
Network	aws:network:route-traffic-through-middlebox	Route traffic through an intermediary
SSM	aws:ssm:send-command	Run SSM document (CPU/memory stress, kill processes)
CloudWatch	aws:cloudwatch:assert-alarm-state	Assert an alarm is in expected state

For OS-level faults (CPU stress, memory pressure, disk fill, process kill), FIS uses SSM Run Command to execute stress-ng or custom scripts on EC2 instances. This requires the SSM agent and appropriate IAM permissions.

bash

# Example: CPU stress via FIS SSM action in experiment template
{
  "actions": {
    "cpu-stress": {
      "actionId": "aws:ssm:send-command",
      "parameters": {
        "documentArn": "arn:aws:ssm:us-east-1::document/AWSFIS-Run-CPU-Stress",
        "documentParameters": "{\"CPU\":\"0\",\"DurationSeconds\":\"120\",\"InstallDependencies\":\"True\"}",
        "duration": "PT3M"
      },
      "targets": { "Instances": "targetInstances" }
    }
  }
}

Designing Safe Chaos Experiments

Effective chaos engineering follows a structured methodology. Randomly injecting faults without a hypothesis and monitoring plan can cause real outages without generating useful data.

Step	Activity	FIS Mechanism
1. Define hypothesis	State expected behavior during fault	Document as experiment description
2. Define steady state	Identify metrics that define normal operation	CloudWatch alarms and dashboards
3. Set stop conditions	Define thresholds that halt experiment	Stop conditions (CloudWatch alarms)
4. Start small	Run in dev/staging before production	Target filters by tag (Environment=dev)
5. Increase blast radius	Gradually expand scope	Increase PERCENT from 10% to 25% to 50%
6. Observe and record	Monitor system behavior during fault	CloudWatch, X-Ray, application logs
7. Fix weaknesses	Improve system based on findings	Engineering work between experiments

💡

Use resource tags to scope experiments. Tag test-eligible resources with FISTarget=true and configure FIS targets to only select resources with that tag. This prevents experiments from accidentally touching resources that should not be disrupted.

IAM Permissions and the FIS Service Role

FIS requires an IAM role that grants it permission to perform fault actions on your resources. The principle of least privilege is especially important here - the role should only have permissions for the specific faults you plan to run.

bash

# Example FIS service role permissions for EC2 and RDS experiments
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:TerminateInstances",
        "ec2:StopInstances",
        "ec2:RebootInstances"
      ],
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "StringEquals": {"ec2:ResourceTag/FISTarget": "true"}
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "rds:FailoverDBCluster",
        "rds:RebootDBInstance"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:SendCommand",
        "ssm:GetCommandInvocation"
      ],
      "Resource": "*"
    }
  ]
}

The trust policy for the FIS service role:

bash

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {"Service": "fis.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }
  ]
}

⚠️

Do not use overly broad FIS service roles in production environments. A role with ec2:TerminateInstances on "*" without resource tag conditions could terminate any instance in your account if the experiment targets are misconfigured. Always scope permissions with resource tag conditions.

Common Chaos Engineering Scenarios

These are the most common experiments run with FIS in production engineering teams:

Scenario	Actions Used	What You Learn
ASG instance failure	Terminate 50% of ASG instances	Does ASG auto-scale? Does ALB reroute?
AZ failure simulation	Terminate all instances in one AZ	Is the app truly multi-AZ resilient?
RDS primary failover	aws:rds:failover-db-cluster	How long does failover take? Does app reconnect?
Spot interruption	aws:ec2:send-spot-instance-interruptions	Does app handle 2-minute warning correctly?
CPU pressure	SSM AWSFIS-Run-CPU-Stress	Does auto-scaling trigger? Does app degrade gracefully?
Network latency	aws:network:disrupt-connectivity	Does circuit breaker trip? Do timeouts work?
ECS task failure	aws:ecs:stop-task	Does ECS restart tasks? Does service stay healthy?
Dependency throttling	aws:ssm:send-command + tc/iptables	Does app handle downstream throttling?

💡

FIS integrates with AWS Organizations to run experiments across multiple accounts. This is useful for multi-account environments where you want to test cross-account dependencies or run the same experiment across all member accounts in a controlled way.

🎯

Interview Focus Points

1What is chaos engineering and why is it important for production reliability?
2What are stop conditions in FIS and why are they critical to configure?
3How do you scope a FIS experiment to only affect specific resources?
4What is the methodology for designing a safe chaos experiment from hypothesis to fix?
5How would you simulate an Availability Zone failure using FIS?
6What IAM permissions does FIS need and how do you scope them safely?
7How does FIS integrate with SSM to inject OS-level faults?
8How would you test that your application handles RDS failover correctly?
9What is the difference between stopping an experiment and rolling back an action in FIS?
10How do you prevent chaos experiments from accidentally affecting production resources they should not touch?