Top 10 AWS Interview Mistakes
Published 21 June 2026 by Ace Cloud Interviews
AWS questions catch out candidates who know the service names but have not operated them in production. The gaps are consistent: IAM, networking, observability, and availability come up in almost every mid-to-senior AWS interview. These are the 10 mistakes that cost engineers AWS roles, and what to say instead.
Saying "I'd create an IAM user" for application access
What candidates say
“For my EC2 instance to access S3, I'd create an IAM user and store the access key on the instance.”
Why interviewers mark this down
Access keys on EC2 instances are a security anti-pattern and interviewers testing AWS security will flag this immediately. If the instance is compromised, the access key is too. If the key is accidentally committed to Git, it is valid until manually rotated. EC2 instances can assume IAM roles via the instance metadata service, receiving short-lived credentials that are automatically rotated every few hours. Storing long-lived access keys on instances is explicitly against AWS Well-Architected best practices.
What to say instead
Say: "EC2 instances should assume an IAM role via the instance profile. The application fetches credentials from the instance metadata service - the AWS SDKs do this automatically. The role has minimum required permissions scoped to specific resources, not wildcard actions. No keys to rotate, no keys to leak. The same principle applies to Lambda (execution role), ECS tasks (task role), and EKS pods via IRSA."
Saying you'd use the default VPC without knowing why it is wrong
What candidates say
“I'd launch the EC2 instance into the default VPC to get started quickly.”
Why interviewers mark this down
The default VPC has all subnets public - instances launched into it receive public IP addresses by default. It has a permissive default security group. For anything beyond a quick experiment, this is not acceptable. Interviewers asking about VPC design want to hear that you design the right network topology from the start, not that you accept whatever AWS gives you.
What to say instead
Say: "For any production workload I always create a custom VPC with a three-tier layout: public subnets for load balancers only, private app subnets for compute, and private database subnets for RDS. The default VPC's public-by-default behaviour is a security risk for production resources. I provision the VPC consistently with Terraform or CloudFormation to make it repeatable across environments."
Saying S3 is eventually consistent
What candidates say
“S3 is eventually consistent, so reads immediately after a write might return stale data.”
Why interviewers mark this down
This was true before December 2020, but AWS updated S3 to provide strong read-after-write consistency for all objects in all regions. A GET immediately after a PUT will always return the latest version. Candidates who repeat the old model signal their AWS knowledge is out of date. Interviewers who know AWS well will catch this.
What to say instead
Say: "S3 has provided strong read-after-write consistency for all GET, PUT, LIST, and DELETE operations since late 2020. The eventual consistency caveat is no longer relevant for standard use cases. The one area where propagation delays still apply is changes to bucket policies and ACLs, which can take time to take effect globally."
Saying "I'd use CloudWatch" without specifying which part
What candidates say
“I'd monitor that with CloudWatch.”
Why interviewers mark this down
CloudWatch covers four distinct services with different purposes: Metrics (time-series data, custom namespaces), Logs (log ingestion, log groups, metric filters), Alarms (threshold-based alerts that trigger SNS or Auto Scaling actions), and Container Insights (for ECS/EKS). Saying "CloudWatch" without specifying which part tells an interviewer you know the brand but not the practice. They will follow up, and vague answers there matter.
What to say instead
Say: "For application metrics I publish custom metrics to CloudWatch Metrics via the PutMetricData API. For logs I stream to CloudWatch Logs using the ECS log driver or CloudWatch agent on EC2, with a retention policy set. For alerting I create CloudWatch Alarms with appropriate evaluation periods routed to SNS, which triggers PagerDuty or Slack. I also use metric filters on structured log lines to extract numeric values and alarm on them."
Confusing security groups and network ACLs
What candidates say
“Security groups and NACLs both control traffic - I'd use security groups as they're simpler.”
Why interviewers mark this down
These two controls operate at different layers and have fundamentally different behaviour. Security groups are stateful: if you allow inbound traffic, the response is automatically allowed outbound. NACLs are stateless: you must explicitly allow both directions for each connection. Security groups apply per-resource (ENI level), NACLs apply per-subnet. Security groups only support allow rules; NACLs support explicit deny rules. Not knowing the statefulness difference means you cannot diagnose connectivity issues.
What to say instead
Say: "Security groups are stateful and the primary control at the resource level - allow rules only, applied per-instance. NACLs are stateless and operate at the subnet boundary - you need rules for both inbound and outbound, and they support explicit deny. In practice I use security groups for all fine-grained access control and NACLs as a defence-in-depth layer for subnet-level blocking - for example denying known malicious IP ranges. The statefulness difference is the critical one to explain."
Confusing RDS Multi-AZ with Read Replicas
What candidates say
“Multi-AZ improves read performance across availability zones.”
Why interviewers mark this down
Multi-AZ and Read Replicas solve completely different problems. Multi-AZ is availability-only: AWS maintains a synchronous standby replica in a different AZ and fails over automatically if the primary goes down. The standby is not readable. Read Replicas are for read scalability: asynchronous replicas you direct read traffic to, but they do not provide automatic failover. Confusing these two is one of the most common database architecture mistakes interviewers catch.
What to say instead
Say: "Multi-AZ is high availability - synchronous replication to a standby in another AZ, automatic failover in 60-120 seconds, standby is never accessible for reads. Read Replicas are read scalability - asynchronous replication to one or more replicas you route read traffic to, but they do not protect against primary failure on their own. For production I enable Multi-AZ for HA and add Read Replicas only if read load justifies the additional cost."
Describing step scaling when target tracking is the right answer
What candidates say
“I'd set up a CloudWatch alarm on CPU at 70% to trigger Auto Scaling.”
Why interviewers mark this down
Configuring alarms and step policies manually is the old pattern. Target tracking scaling is the modern default: you specify a target metric value (e.g. 50% CPU) and Auto Scaling continuously adjusts the desired count to maintain that target without you tuning thresholds and step sizes. Target tracking is more responsive, handles scale-in more gracefully, and requires far less ongoing maintenance. Describing step scaling in 2026 without mentioning target tracking signals knowledge that has not been updated.
What to say instead
Say: "For general compute I use target tracking - set a target CPU of 50% and Auto Scaling handles the math. For predictable traffic patterns I add scheduled scaling on top (scale out before a known daily peak, scale in overnight). For queue workers I use a custom metric policy based on SQS approximate number of messages. Step scaling is valid for precise control scenarios but target tracking is the right default and what I would choose unless there is a specific reason not to."
Treating Lambda as just "serverless functions" without operational depth
What candidates say
“Lambda is a serverless function service - you just upload your code and it runs.”
Why interviewers mark this down
This surface-level answer fails at mid-level and above. Interviewers want to hear about cold starts and provisioned concurrency, the concurrency model (simultaneous invocations, reserved vs unreserved concurrency, account limits), what happens when Lambda runs inside a VPC (ENI provisioning, cold start impact, NAT Gateway or VPC endpoint requirement for internet access), the 15-minute execution limit, and the connection handling problem with databases at scale.
What to say instead
Say: "Cold starts matter for latency-sensitive paths - provisioned concurrency keeps warm instances ready at a cost. Inside a VPC, Lambda needs to provision an ENI which adds cold start time, and requires a NAT Gateway or VPC endpoint for outbound internet access. For database connections Lambda does not maintain persistent connections, so spiky Lambda traffic can exhaust RDS connection limits - RDS Proxy solves this by pooling connections. I scope Lambda to short-lived, stateless, event-driven tasks where cold start latency is acceptable."
Reciting the shared responsibility model without applying it
What candidates say
“AWS is responsible for security of the cloud, customers are responsible for security in the cloud.”
Why interviewers mark this down
This is the AWS tagline, not an answer. Interviewers want you to apply the model to a specific scenario. What is AWS responsible for on an EC2 instance? What does your responsibility change to on RDS? On Lambda? Reciting the phrase without applying it to the service in question signals you memorised marketing copy rather than understanding the operational implications.
What to say instead
Say: "For EC2: AWS owns the physical hardware and hypervisor, I own everything from the OS up - patching, security group config, IAM, application security, data encryption. For RDS: AWS takes on OS and database engine patching, I still own database permissions, VPC config, encryption at rest, and backups. For Lambda: AWS owns the OS and runtime, I own the function code, execution role permissions, and data handling. The model shifts more responsibility to AWS as services become more managed."
Not knowing Route 53 routing policies beyond basic DNS
What candidates say
“Route 53 is AWS's DNS service - you create A records and CNAMEs pointing to your resources.”
Why interviewers mark this down
Basic DNS knowledge is not what interviewers probe with Route 53. They want to know you can design for availability and global performance using routing policies: Failover routing (primary and secondary with health checks), Latency-based routing (route to lowest-latency region per user), Weighted routing (send 10% of traffic to a new deployment for canary testing), and Geolocation routing (enforce data residency). Not knowing these means you cannot design multi-region architectures.
What to say instead
Say: "Route 53 routing policies go well beyond DNS. For active-passive failover I use Failover routing with health checks - if the primary fails its health check Route 53 automatically serves the secondary. For canary deployments I use Weighted routing to send a small percentage to a new version. For multi-region low-latency architectures I use Latency-based routing so users are served from the closest healthy region. For data residency I use Geolocation routing to ensure EU users always hit EU infrastructure."
The bottom line
These gaps come up in almost every AWS interview at mid-level and above. The pattern is the same as with Kubernetes: candidates who know the service names and surface behaviour get filtered out when interviewers probe the operational details. Understand why security groups are stateful but NACLs are not, why Multi-AZ and Read Replicas are not interchangeable, and why access keys on instances are always wrong - these are the questions that separate engineers who have run AWS in production from those who have only read about it.