AWS Storage
S3
Infinitely scalable object storage with 99.999999999% (11 nines) durability
Amazon S3 (Simple Storage Service) is infinitely scalable object storage offering 11 nines (99.999999999%) durability by automatically replicating data across multiple Availability Zones. It serves as the backbone for data lakes, static website hosting, backup archives, and application asset storage. Every cloud architect must understand S3 deeply because it underpins dozens of AWS services and is almost always present in production architectures.
S3 Storage Classes and When to Use Each
S3 offers multiple storage classes optimized for different access patterns and cost profiles. Choosing the wrong class is one of the most common sources of unexpected AWS bills.
| Storage Class | Availability | Min Duration | Retrieval | Use Case |
|---|---|---|---|---|
| Standard | 99.99% | None | Immediate | Frequently accessed data |
| Intelligent-Tiering | 99.9% | 30 days | Immediate | Unknown or changing access patterns |
| Standard-IA | 99.9% | 30 days | Immediate | Infrequent access, rapid when needed |
| One Zone-IA | 99.5% | 30 days | Immediate | Infrequent, non-critical, reproducible |
| Glacier Instant | 99.9% | 90 days | Milliseconds | Archive with immediate access |
| Glacier Flexible | 99.99% | 90 days | 1-12 hours | Backups, disaster recovery |
| Glacier Deep Archive | 99.99% | 180 days | 12-48 hours | Long-term compliance archives |
Intelligent-Tiering automatically moves objects between tiers based on access patterns. There is a monitoring fee per 1,000 objects but no retrieval fees - ideal when you cannot predict access patterns.
Standard-IA and One Zone-IA charge a retrieval fee per GB. If you access IA data frequently, you can end up paying more than Standard pricing. Always model expected access before choosing IA classes.
Bucket Configuration and Security Model
S3 uses a layered security model combining bucket policies, IAM policies, ACLs, and Block Public Access settings. Understanding how these interact is critical for preventing data leaks.
| Control Layer | Scope | Best Practice |
|---|---|---|
| Block Public Access | Account or bucket level | Enable on all buckets unless intentionally public |
| Bucket Policy | Bucket and object level | Use for cross-account access and enforce HTTPS |
| IAM Policy | Principal (user/role) level | Use for granting AWS principals access |
| ACLs | Object level | Disable ACLs - use bucket policies instead (AWS now recommends this) |
| S3 Access Points | Application level | Use for multiple apps sharing one bucket with different permissions |
# Enforce HTTPS-only access via bucket policy
aws s3api put-bucket-policy --bucket my-bucket --policy '{
"Version": "2012-10-17",
"Statement": [{
"Sid": "DenyHTTP",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"],
"Condition": {"Bool": {"aws:SecureTransport": "false"}}
}]
}'As of April 2023, S3 Object Ownership defaults to "Bucket owner enforced" for new buckets, which disables ACLs entirely. This is the recommended setting.
Versioning, Replication, and Lifecycle Rules
Versioning preserves every version of an object, enabling recovery from accidental deletes and overwrites. Replication copies objects to another bucket, optionally in another region or account.
| Feature | CRR (Cross-Region) | SRR (Same-Region) |
|---|---|---|
| Primary use | Compliance, latency reduction, DR | Log aggregation, dev/prod sync |
| Versioning required | Yes - on source and destination | Yes - on source and destination |
| Existing objects | Not replicated automatically | Not replicated automatically |
| Delete markers | Optional replication | Optional replication |
| Cost | Data transfer + replication requests | Replication requests only |
Lifecycle rules automate transitioning objects between storage classes and expiring old versions:
# Example: transition to IA after 30 days, Glacier after 90, expire after 365
aws s3api put-bucket-lifecycle-configuration \
--bucket my-bucket \
--lifecycle-configuration file://lifecycle.json
# lifecycle.json snippet:
{
"Rules": [{
"ID": "archive-old-objects",
"Status": "Enabled",
"Filter": {"Prefix": "logs/"},
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER"}
],
"Expiration": {"Days": 365}
}]
}Replication only applies to new objects after replication is enabled. Use S3 Batch Operations to replicate existing objects. Also note - S3 does not replicate objects that already exist in the destination bucket.
Performance Optimization and Common Patterns
S3 automatically partitions based on key prefixes. Understanding how S3 partitions data helps you design key naming conventions that scale without throttling.
| Pattern | Description | Use Case |
|---|---|---|
| Prefix randomization | Add hash prefix to avoid hot partitions (old guidance - now largely unnecessary) | Very high-throughput legacy workloads |
| Multipart upload | Upload objects >100MB in parallel parts | Large files, resumable uploads |
| Transfer Acceleration | Route uploads via CloudFront edge network | Global upload performance |
| S3 Select | Query CSV/JSON/Parquet with SQL - retrieve subset | Reduce data transfer for analytics |
| Requester Pays | Downloader pays transfer costs | Public datasets, cost sharing |
| Presigned URLs | Temporary authenticated URLs | Direct browser upload/download without proxy |
S3 supports at least 3,500 PUT/COPY/POST/DELETE and 5,500 GET/HEAD requests per second per prefix. With multiple prefixes, throughput scales linearly. For most workloads this is not a bottleneck.
Presigned URLs are a critical pattern for serverless architectures - they allow clients to upload directly to S3 without routing large files through your application servers:
# Generate a presigned URL for direct browser upload (expires in 1 hour)
aws s3 presign s3://my-bucket/uploads/video.mp4 \
--expires-in 3600
# For PUT operations (upload), use AWS SDK:
import boto3
s3 = boto3.client('s3')
url = s3.generate_presigned_url(
'put_object',
Params={'Bucket': 'my-bucket', 'Key': 'uploads/video.mp4'},
ExpiresIn=3600
)S3 Pricing Model and Cost Optimization
S3 costs have four main dimensions: storage, requests, data transfer, and optional features. Data transfer out to the internet is often the largest surprise cost.
| Cost Component | Standard Pricing (us-east-1) | Optimization |
|---|---|---|
| Storage | $0.023/GB/month first 50TB | Use lifecycle rules to tier down to IA/Glacier |
| PUT/COPY/POST/LIST | $0.005 per 1,000 requests | Batch small writes, avoid excessive list operations |
| GET/SELECT | $0.0004 per 1,000 requests | Use CloudFront to cache and reduce origin GETs |
| Data transfer out | $0.09/GB (after 1GB free) | Use CloudFront - no transfer fee S3 to CloudFront |
| Replication | $0.015/GB transferred | Replicate only what is needed |
| S3 Inventory | $0.0025 per million objects listed | Replace frequent LIST operations with Inventory |
Data transfer between S3 and EC2/Lambda in the same region is free. S3 to CloudFront is free. The expensive transfer is S3 to the internet or to another region.
Interview Focus Points
- 1How would you design an S3-based data lake for a company ingesting 10TB of logs per day?
- 2What is the difference between S3 bucket policies and IAM policies - when do you use each?
- 3A developer accidentally deleted important files from S3. How do you recover them and prevent this in the future?
- 4Explain how you would use presigned URLs to allow customers to upload files directly to S3 from a browser.
- 5What causes S3 throttling and how would you redesign a key naming scheme to avoid it?
- 6Walk me through the S3 storage classes - how would you choose between Standard-IA and Glacier Instant Retrieval?
- 7How does S3 Cross-Region Replication work and what are its limitations?
- 8Your S3 data transfer costs are unexpectedly high. What are the top causes and how do you diagnose them?
- 9How would you enforce encryption at rest and in transit for all objects in an S3 bucket?
- 10What is S3 Object Lock and when would a compliance team require it?