Ace Cloud Interviews
Home/AWS Tutorial/Fault Injection Simulator
🛠️

AWS Developer Tools & CI/CD

Fault Injection Simulator

Chaos engineering service to run controlled fault injection experiments on AWS

AWS Fault Injection Simulator (FIS) is a managed chaos engineering service that enables you to run controlled fault injection experiments on AWS infrastructure - injecting CPU stress, killing instances, throttling API calls, disrupting networks, and more - to validate that your systems are resilient before real failures occur. It provides safety mechanisms including stop conditions and rollback actions to prevent experiments from cascading into actual outages.

Core Concepts: Experiments, Actions, Targets, and Stop Conditions

FIS experiments are defined as templates that describe what fault to inject, where to inject it, and when to stop automatically.

ConceptDescriptionExample
Experiment templateReusable definition of an experimentKill 50% of EC2 instances in ASG
ActionThe fault to inject (AWS-provided or custom)aws:ec2:terminate-instances
TargetWhich resources to apply the action toEC2 instances tagged with Environment=prod
Selection modeHow many resources to select from targetCOUNT(2), PERCENT(50), ALL
Stop conditionCloudWatch alarm that halts the experiment if breachedAlarm: Error5xxRate > 5%
DurationHow long to maintain the faultPT2M (2 minutes) in ISO 8601 format
Start afterDependency between actions (run this action after another)Start network disruption after CPU stress begins
💡

Stop conditions are critical safety mechanisms. Always define a CloudWatch alarm (e.g., on error rate, latency, or custom metric) as a stop condition. If the alarm fires during the experiment, FIS immediately rolls back the fault. Running experiments without stop conditions is dangerous in production.

Available Fault Injection Actions

FIS provides pre-built actions across multiple AWS services:

CategoryActionWhat It Does
EC2aws:ec2:terminate-instancesTerminates selected EC2 instances
EC2aws:ec2:stop-instancesStops (not terminates) instances
EC2aws:ec2:reboot-instancesReboots instances
EC2aws:ec2:send-spot-instance-interruptionsSimulates spot interruption notice + termination
ECSaws:ecs:drain-container-instancesDrains tasks from container instances
ECSaws:ecs:stop-taskStops ECS tasks
EKSaws:eks:terminate-nodegroup-instancesTerminates nodes in a node group
RDSaws:rds:failover-db-clusterTriggers RDS failover to replica
RDSaws:rds:reboot-db-instancesReboots DB instances
Networkaws:network:disrupt-connectivityBlocks network traffic (ACL injection)
Networkaws:network:route-traffic-through-middleboxRoute traffic through an intermediary
SSMaws:ssm:send-commandRun SSM document (CPU/memory stress, kill processes)
CloudWatchaws:cloudwatch:assert-alarm-stateAssert an alarm is in expected state

For OS-level faults (CPU stress, memory pressure, disk fill, process kill), FIS uses SSM Run Command to execute stress-ng or custom scripts on EC2 instances. This requires the SSM agent and appropriate IAM permissions.

bash
# Example: CPU stress via FIS SSM action in experiment template
{
  "actions": {
    "cpu-stress": {
      "actionId": "aws:ssm:send-command",
      "parameters": {
        "documentArn": "arn:aws:ssm:us-east-1::document/AWSFIS-Run-CPU-Stress",
        "documentParameters": "{\"CPU\":\"0\",\"DurationSeconds\":\"120\",\"InstallDependencies\":\"True\"}",
        "duration": "PT3M"
      },
      "targets": { "Instances": "targetInstances" }
    }
  }
}

Designing Safe Chaos Experiments

Effective chaos engineering follows a structured methodology. Randomly injecting faults without a hypothesis and monitoring plan can cause real outages without generating useful data.

StepActivityFIS Mechanism
1. Define hypothesisState expected behavior during faultDocument as experiment description
2. Define steady stateIdentify metrics that define normal operationCloudWatch alarms and dashboards
3. Set stop conditionsDefine thresholds that halt experimentStop conditions (CloudWatch alarms)
4. Start smallRun in dev/staging before productionTarget filters by tag (Environment=dev)
5. Increase blast radiusGradually expand scopeIncrease PERCENT from 10% to 25% to 50%
6. Observe and recordMonitor system behavior during faultCloudWatch, X-Ray, application logs
7. Fix weaknessesImprove system based on findingsEngineering work between experiments
💡

Use resource tags to scope experiments. Tag test-eligible resources with FISTarget=true and configure FIS targets to only select resources with that tag. This prevents experiments from accidentally touching resources that should not be disrupted.

IAM Permissions and the FIS Service Role

FIS requires an IAM role that grants it permission to perform fault actions on your resources. The principle of least privilege is especially important here - the role should only have permissions for the specific faults you plan to run.

bash
# Example FIS service role permissions for EC2 and RDS experiments
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:TerminateInstances",
        "ec2:StopInstances",
        "ec2:RebootInstances"
      ],
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "StringEquals": {"ec2:ResourceTag/FISTarget": "true"}
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "rds:FailoverDBCluster",
        "rds:RebootDBInstance"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:SendCommand",
        "ssm:GetCommandInvocation"
      ],
      "Resource": "*"
    }
  ]
}

The trust policy for the FIS service role:

bash
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {"Service": "fis.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }
  ]
}
⚠️

Do not use overly broad FIS service roles in production environments. A role with ec2:TerminateInstances on "*" without resource tag conditions could terminate any instance in your account if the experiment targets are misconfigured. Always scope permissions with resource tag conditions.

Common Chaos Engineering Scenarios

These are the most common experiments run with FIS in production engineering teams:

ScenarioActions UsedWhat You Learn
ASG instance failureTerminate 50% of ASG instancesDoes ASG auto-scale? Does ALB reroute?
AZ failure simulationTerminate all instances in one AZIs the app truly multi-AZ resilient?
RDS primary failoveraws:rds:failover-db-clusterHow long does failover take? Does app reconnect?
Spot interruptionaws:ec2:send-spot-instance-interruptionsDoes app handle 2-minute warning correctly?
CPU pressureSSM AWSFIS-Run-CPU-StressDoes auto-scaling trigger? Does app degrade gracefully?
Network latencyaws:network:disrupt-connectivityDoes circuit breaker trip? Do timeouts work?
ECS task failureaws:ecs:stop-taskDoes ECS restart tasks? Does service stay healthy?
Dependency throttlingaws:ssm:send-command + tc/iptablesDoes app handle downstream throttling?
💡

FIS integrates with AWS Organizations to run experiments across multiple accounts. This is useful for multi-account environments where you want to test cross-account dependencies or run the same experiment across all member accounts in a controlled way.

🎯

Interview Focus Points

  • 1What is chaos engineering and why is it important for production reliability?
  • 2What are stop conditions in FIS and why are they critical to configure?
  • 3How do you scope a FIS experiment to only affect specific resources?
  • 4What is the methodology for designing a safe chaos experiment from hypothesis to fix?
  • 5How would you simulate an Availability Zone failure using FIS?
  • 6What IAM permissions does FIS need and how do you scope them safely?
  • 7How does FIS integrate with SSM to inject OS-level faults?
  • 8How would you test that your application handles RDS failover correctly?
  • 9What is the difference between stopping an experiment and rolling back an action in FIS?
  • 10How do you prevent chaos experiments from accidentally affecting production resources they should not touch?