AWS Database
Redshift
Petabyte-scale cloud data warehouse optimized for analytics workloads
Amazon Redshift is a fully managed petabyte-scale cloud data warehouse optimized for OLAP (Online Analytical Processing) workloads - aggregations, joins, and scans across hundreds of billions of rows. It uses a massively parallel processing (MPP) columnar storage architecture that is fundamentally different from row-oriented OLTP databases. Redshift is the central piece of most AWS analytics architectures, integrating with S3, Aurora, Glue, QuickSight, and SageMaker.
Redshift Architecture: Leader Node, Compute Nodes, and Slices
A Redshift cluster has one leader node and one or more compute nodes. The leader node receives queries, builds query plans, and coordinates execution. Compute nodes execute the plan in parallel across slices (virtual CPU cores within each node). Data is distributed across slices according to the distribution style.
| Component | Role | Key Detail |
|---|---|---|
| Leader node | Query parsing, planning, coordination | Not charged separately; no data stored here |
| Compute node | Parallel data storage and execution | Comes in Dense Compute (dc2) or RA3 node types |
| Slice | Unit of parallel processing within a node | Each node has 2-32 slices depending on type |
| Node storage | dc2: local SSD; RA3: S3-backed managed storage | RA3 separates compute from storage |
| Cluster endpoint | JDBC/ODBC connection point | Always connects to the leader node |
RA3 nodes are the modern default. They use Redshift Managed Storage (RMS) backed by S3, so you can scale compute independently from storage. This makes RA3 clusters much easier to right-size than legacy dc2 clusters where compute and storage are coupled.
Distribution Keys and Sort Keys: The Core Performance Levers
Distribution style determines how rows are spread across slices. Sort key determines the physical order of data on disk. These two design choices have more impact on Redshift query performance than almost anything else.
| Distribution Style | Behaviour | Best For |
|---|---|---|
| EVEN | Round-robin across all slices | No clear join key; tables not frequently joined |
| KEY | Rows with same key value go to same slice | Large fact tables joined on a common key (reduces data redistribution) |
| ALL | Full copy of table on every node | Small dimension tables joined frequently (eliminates network shuffle) |
| AUTO | AWS chooses based on table size; changes over time | Default for new tables; good starting point |
| Sort Key Type | Behaviour | Best For |
|---|---|---|
| Compound sort key | Sorted on columns in order defined; prefix queries most efficient | Queries that always filter on first column(s) |
| Interleaved sort key | Equal weight to all columns; no prefix advantage | Multiple queries filtering on different columns (deprecated - use compound) |
Interleaved sort keys are deprecated and AWS recommends migrating to compound sort keys or AQUA. A common sort key pattern is (date_column, customer_id) for time-series fact tables - most queries filter by date range first.
Redshift Serverless vs Provisioned Clusters
Redshift Serverless eliminates cluster management entirely. It automatically scales capacity and charges per RPU-second (Redshift Processing Unit). Provisioned clusters give you more control over node types and pricing.
| Attribute | Provisioned | Serverless |
|---|---|---|
| Capacity management | Choose node type and count | Automatic scaling - no nodes to manage |
| Pricing | Per node-hour (+ storage for RA3) | Per RPU-second (min 8 RPUs) |
| Idle cost | Full cost even when idle | Scales to zero after inactivity (configurable) |
| Concurrency scaling | Separate concurrency scaling clusters | Automatic |
| RA3 managed storage | Yes | Uses managed storage automatically |
| Best for | Steady high-concurrency production DW | Dev/test, irregular workloads, new projects |
Redshift Serverless has a base RPU cost even when running queries - it does not drop to truly zero cost during a query. The minimum is 8 RPUs (approximately the cost of a small provisioned cluster). For 24/7 steady-state analytical workloads, provisioned with Reserved Instances is usually cheaper.
Redshift Spectrum and Zero-ETL Integrations
Redshift can query data directly in S3 without loading it (Spectrum), and can receive near-real-time data from Aurora and DynamoDB via Zero-ETL.
| Feature | What It Does | Use Case |
|---|---|---|
| COPY command | Bulk load from S3, DynamoDB, Kinesis, EMR into Redshift tables | Initial data load; daily batch ingestion |
| Redshift Spectrum | Query S3 data (Parquet, CSV, ORC) via external tables without loading | Data lake queries; historical cold data in S3 |
| Zero-ETL (Aurora) | Near-real-time replication of Aurora data into Redshift automatically | Operational analytics without DMS pipelines |
| Zero-ETL (DynamoDB) | Replicate DynamoDB tables into Redshift | Flatten NoSQL data for SQL analytics |
| UNLOAD command | Export query results to S3 in parallel | Data export; feeding downstream pipelines |
-- COPY data from S3 (most efficient ingestion method)
COPY sales
FROM 's3://my-bucket/sales/2024/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Role'
FORMAT AS PARQUET;
-- Query S3 data directly with Spectrum
SELECT s.customer_id, SUM(s.amount)
FROM spectrum_schema.cold_sales s
JOIN public.customers c ON s.customer_id = c.id
WHERE s.sale_date >= '2023-01-01'
GROUP BY 1;
-- UNLOAD results to S3
UNLOAD ('SELECT * FROM sales WHERE sale_date = CURRENT_DATE')
TO 's3://my-bucket/daily-export/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Role'
FORMAT PARQUET PARALLEL ON;Performance Tuning: WLM, Vacuum, Analyze
Redshift performance degrades over time without maintenance. The three main operational tasks are WLM tuning (managing query queues), VACUUM (reclaiming space from deletes/updates), and ANALYZE (updating table statistics).
| Operation | Purpose | When to Run |
|---|---|---|
| VACUUM | Reclaims space from deleted rows; re-sorts unsorted rows | After large delete/update operations; scheduled weekly |
| ANALYZE | Updates table statistics used by the query planner | After significant data loads; automatically runs via auto-analyze |
| WLM (Workload Management) | Routes queries to queues with memory/concurrency limits | Configure to separate ETL from BI queries |
| Concurrency Scaling | Adds burst capacity for concurrent queries | Enable for unpredictable BI query bursts |
A VACUUM FULL on a large table can run for hours and consumes cluster resources. Schedule VACUUM during off-peak hours or use VACUUM SORT ONLY / VACUUM DELETE ONLY to do partial maintenance. Auto-vacuum handles most cases automatically on modern Redshift clusters.
Interview Focus Points
- 1Explain Redshift's MPP architecture. How does it differ from a standard RDBMS like PostgreSQL?
- 2What is a distribution key and how do you choose one? What happens if you choose poorly?
- 3What is a sort key in Redshift and how does it improve query performance?
- 4When would you use Redshift Spectrum instead of loading data into Redshift?
- 5Compare Redshift Serverless vs provisioned clusters. When is each more appropriate?
- 6What is the COPY command and why is it preferred over INSERT for bulk data loading?
- 7Explain WLM in Redshift. How would you configure it to separate ETL from BI query workloads?
- 8Why does Redshift performance degrade over time and how do you maintain it?
- 9How does Zero-ETL from Aurora to Redshift work? What problem does it solve?
- 10A Redshift query that used to run in 30 seconds now takes 10 minutes. How do you diagnose it?