AWS Database
RDS
Managed relational databases - MySQL, PostgreSQL, Oracle, SQL Server, MariaDB
Amazon RDS (Relational Database Service) is a managed service that handles provisioning, patching, backups, and failover for six database engines: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and Amazon Aurora. It removes the undifferentiated heavy lifting of running a relational database so engineers can focus on schema design and query optimization rather than OS maintenance. RDS is the default choice for any OLTP workload that needs SQL semantics and ACID guarantees.
How RDS Works: Instance, Storage, and Replication
An RDS deployment consists of a DB instance (compute), storage (EBS-backed gp2/gp3/io1/io2), and optionally a Multi-AZ standby or read replicas. The primary instance writes to EBS, which is synchronously replicated to the standby in Multi-AZ mode. Read replicas use asynchronous replication and serve read traffic.
| Component | Description | Key Behaviour |
|---|---|---|
| DB Instance | Compute running the database engine | Sized by instance class (db.t3, db.r6g, etc.) |
| EBS Storage | Persistent block storage attached to the instance | gp3 is default; io1/io2 for high IOPS |
| Multi-AZ Standby | Synchronous replica in a different AZ | Automatic failover in ~60-120 seconds |
| Read Replica | Asynchronous read-only copy | Up to 5 per primary; can be cross-region |
| Parameter Group | Engine configuration (e.g. max_connections) | Changes may require a reboot |
| Option Group | Optional engine features (e.g. Oracle APEX) | Engine-specific add-ons |
Multi-AZ is for high availability (HA), not for read scaling. Read replicas are for read scaling but do not provide automatic failover.
Storage Types and IOPS Sizing
Choosing the right storage type is one of the most common RDS sizing mistakes. Under-provisioning IOPS leads to queue depth buildup and latency spikes that are hard to diagnose.
| Storage Type | Max IOPS | Max Throughput | Use Case |
|---|---|---|---|
| gp2 | 16,000 (burst) | 250 MB/s | General purpose, legacy default |
| gp3 | 64,000 | 4,000 MB/s | General purpose, cost-optimized default |
| io1 | 64,000 | 1,000 MB/s | I/O-intensive OLTP (legacy) |
| io2 Block Express | 256,000 | 4,000 MB/s | Mission-critical, sub-millisecond latency |
Migrate existing gp2 instances to gp3 - you get 3,000 IOPS and 125 MB/s baseline at no extra cost versus gp2's 100 IOPS/GB ratio. For volumes under 1 TB this is almost always cheaper.
Storage autoscaling only expands volume - it never shrinks. Plan your initial size with headroom because you cannot scale down without creating a new instance from a snapshot.
HA and Disaster Recovery Patterns
RDS provides several layers of protection. Understanding the RTO and RPO of each is essential for architecture decisions and disaster recovery planning interviews.
| Pattern | RTO | RPO | Notes |
|---|---|---|---|
| Multi-AZ (synchronous) | 60-120 sec | Near zero | Automatic failover, same region |
| Read Replica promoted | Minutes (manual) | Seconds of lag | Manual intervention required |
| Cross-region read replica | Minutes (manual) | Seconds to minutes | Good DR target for another region |
| Automated backups | Hours | Up to 5 min (PITR) | Point-in-time recovery within retention window |
| Manual snapshots | Hours | Snapshot age | Persists after instance deletion |
Enable automated backups with at least a 7-day retention period to get Point-in-Time Recovery (PITR). PITR restores from the last backup plus transaction logs, giving you recovery to any second within the retention window.
RDS Proxy: Connection Pooling for Serverless Workloads
RDS Proxy sits between your application and RDS, pooling and sharing database connections. It is critical when using Lambda - each Lambda invocation opens a new connection, and without a proxy, a burst to 1,000 concurrent Lambdas creates 1,000 database connections, which can exhaust max_connections on small instances.
| Feature | Without Proxy | With RDS Proxy |
|---|---|---|
| Connection count | One per app thread/Lambda | Pooled - far fewer to DB |
| Failover time | 60-120 seconds | Reduced by pinning to new primary faster |
| IAM auth | Possible but complex | Native IAM auth support |
| Cost | No extra cost | $0.015/vCPU-hour of DB instance |
# Create an RDS Proxy
aws rds create-db-proxy \
--db-proxy-name my-proxy \
--engine-family MYSQL \
--auth '[{"AuthScheme":"SECRETS","SecretArn":"arn:aws:secretsmanager:...","IAMAuth":"REQUIRED"}]' \
--role-arn arn:aws:iam::123456789012:role/rds-proxy-role \
--vpc-subnet-ids subnet-abc subnet-defPricing Model and Cost Optimization
RDS pricing has five components. Optimizing each one independently can cut costs significantly without sacrificing performance.
| Component | Pricing Basis | Optimization Tip |
|---|---|---|
| Instance hours | Per hour by instance class | Reserve 1-3 years for production (up to 69% savings) |
| Storage | Per GB-month (gp3 cheaper than gp2) | Migrate to gp3; enable storage autoscaling |
| I/O requests | Per million I/Os (gp2/gp3 baseline free) | Monitor read/write IOPS; io1 only if needed |
| Backup storage | Free up to DB size; per GB beyond | Reduce retention window on dev/staging |
| Data transfer | Per GB for cross-AZ/cross-region | Keep app and DB in same AZ to avoid cross-AZ fees |
| Multi-AZ | ~2x instance + storage cost | Use only in production; use single-AZ in dev |
Cross-AZ data transfer is charged even within the same VPC. If your application and RDS are in different AZs you pay for every byte. Always pin your app instances to the same AZ as the primary RDS endpoint when latency and cost matter.
CLI Commands and Operational Runbook
# Describe all RDS instances
aws rds describe-db-instances --query 'DBInstances[*].[DBInstanceIdentifier,DBInstanceStatus,MultiAZ]' --output table
# Create a manual snapshot before a risky migration
aws rds create-db-snapshot \
--db-instance-identifier mydb \
--db-snapshot-identifier mydb-pre-migration-$(date +%Y%m%d)
# Initiate manual Multi-AZ failover (for testing)
aws rds reboot-db-instance \
--db-instance-identifier mydb \
--force-failover
# Restore to a point in time
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier mydb \
--target-db-instance-identifier mydb-restored \
--restore-time 2024-01-15T12:00:00Z
# Modify instance class (will cause downtime without Multi-AZ)
aws rds modify-db-instance \
--db-instance-identifier mydb \
--db-instance-class db.r6g.large \
--apply-immediatelyInterview Focus Points
- 1What is the difference between Multi-AZ and a read replica? When would you use each?
- 2Walk me through what happens during an RDS Multi-AZ failover. What is the RTO and RPO?
- 3A Lambda function is throwing "too many connections" errors against RDS. How do you fix it?
- 4When would you choose io2 Block Express storage over gp3? What metrics would guide that decision?
- 5How does Point-in-Time Recovery work in RDS? What are its limitations?
- 6How would you migrate an on-premises MySQL database to RDS with minimal downtime?
- 7What happens to RDS when the EBS volume runs out of space? How do you prevent it?
- 8Explain RDS Proxy - how does it work and what problems does it solve?
- 9What is the difference between a parameter group and an option group in RDS?
- 10A production RDS instance is running slow. What metrics and tools do you use to diagnose it?