Redshift

Petabyte-scale cloud data warehouse optimized for analytics workloads

Amazon Redshift is a fully managed petabyte-scale cloud data warehouse optimized for OLAP (Online Analytical Processing) workloads - aggregations, joins, and scans across hundreds of billions of rows. It uses a massively parallel processing (MPP) columnar storage architecture that is fundamentally different from row-oriented OLTP databases. Redshift is the central piece of most AWS analytics architectures, integrating with S3, Aurora, Glue, QuickSight, and SageMaker.

Redshift Architecture: Leader Node, Compute Nodes, and Slices

A Redshift cluster has one leader node and one or more compute nodes. The leader node receives queries, builds query plans, and coordinates execution. Compute nodes execute the plan in parallel across slices (virtual CPU cores within each node). Data is distributed across slices according to the distribution style.

Component	Role	Key Detail
Leader node	Query parsing, planning, coordination	Not charged separately; no data stored here
Compute node	Parallel data storage and execution	Comes in Dense Compute (dc2) or RA3 node types
Slice	Unit of parallel processing within a node	Each node has 2-32 slices depending on type
Node storage	dc2: local SSD; RA3: S3-backed managed storage	RA3 separates compute from storage
Cluster endpoint	JDBC/ODBC connection point	Always connects to the leader node

💡

RA3 nodes are the modern default. They use Redshift Managed Storage (RMS) backed by S3, so you can scale compute independently from storage. This makes RA3 clusters much easier to right-size than legacy dc2 clusters where compute and storage are coupled.

Distribution Keys and Sort Keys: The Core Performance Levers

Distribution style determines how rows are spread across slices. Sort key determines the physical order of data on disk. These two design choices have more impact on Redshift query performance than almost anything else.

Distribution Style	Behaviour	Best For
EVEN	Round-robin across all slices	No clear join key; tables not frequently joined
KEY	Rows with same key value go to same slice	Large fact tables joined on a common key (reduces data redistribution)
ALL	Full copy of table on every node	Small dimension tables joined frequently (eliminates network shuffle)
AUTO	AWS chooses based on table size; changes over time	Default for new tables; good starting point

Sort Key Type	Behaviour	Best For
Compound sort key	Sorted on columns in order defined; prefix queries most efficient	Queries that always filter on first column(s)
Interleaved sort key	Equal weight to all columns; no prefix advantage	Multiple queries filtering on different columns (deprecated - use compound)

⚠️

Interleaved sort keys are deprecated and AWS recommends migrating to compound sort keys or AQUA. A common sort key pattern is (date_column, customer_id) for time-series fact tables - most queries filter by date range first.

Redshift Serverless vs Provisioned Clusters

Redshift Serverless eliminates cluster management entirely. It automatically scales capacity and charges per RPU-second (Redshift Processing Unit). Provisioned clusters give you more control over node types and pricing.

Attribute	Provisioned	Serverless
Capacity management	Choose node type and count	Automatic scaling - no nodes to manage
Pricing	Per node-hour (+ storage for RA3)	Per RPU-second (min 8 RPUs)
Idle cost	Full cost even when idle	Scales to zero after inactivity (configurable)
Concurrency scaling	Separate concurrency scaling clusters	Automatic
RA3 managed storage	Yes	Uses managed storage automatically
Best for	Steady high-concurrency production DW	Dev/test, irregular workloads, new projects

💡

Redshift Serverless has a base RPU cost even when running queries - it does not drop to truly zero cost during a query. The minimum is 8 RPUs (approximately the cost of a small provisioned cluster). For 24/7 steady-state analytical workloads, provisioned with Reserved Instances is usually cheaper.

Redshift Spectrum and Zero-ETL Integrations

Redshift can query data directly in S3 without loading it (Spectrum), and can receive near-real-time data from Aurora and DynamoDB via Zero-ETL.

Feature	What It Does	Use Case
COPY command	Bulk load from S3, DynamoDB, Kinesis, EMR into Redshift tables	Initial data load; daily batch ingestion
Redshift Spectrum	Query S3 data (Parquet, CSV, ORC) via external tables without loading	Data lake queries; historical cold data in S3
Zero-ETL (Aurora)	Near-real-time replication of Aurora data into Redshift automatically	Operational analytics without DMS pipelines
Zero-ETL (DynamoDB)	Replicate DynamoDB tables into Redshift	Flatten NoSQL data for SQL analytics
UNLOAD command	Export query results to S3 in parallel	Data export; feeding downstream pipelines

bash

-- COPY data from S3 (most efficient ingestion method)
COPY sales
FROM 's3://my-bucket/sales/2024/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Role'
FORMAT AS PARQUET;

-- Query S3 data directly with Spectrum
SELECT s.customer_id, SUM(s.amount)
FROM spectrum_schema.cold_sales s
JOIN public.customers c ON s.customer_id = c.id
WHERE s.sale_date >= '2023-01-01'
GROUP BY 1;

-- UNLOAD results to S3
UNLOAD ('SELECT * FROM sales WHERE sale_date = CURRENT_DATE')
TO 's3://my-bucket/daily-export/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Role'
FORMAT PARQUET PARALLEL ON;

Performance Tuning: WLM, Vacuum, Analyze

Redshift performance degrades over time without maintenance. The three main operational tasks are WLM tuning (managing query queues), VACUUM (reclaiming space from deletes/updates), and ANALYZE (updating table statistics).

Operation	Purpose	When to Run
VACUUM	Reclaims space from deleted rows; re-sorts unsorted rows	After large delete/update operations; scheduled weekly
ANALYZE	Updates table statistics used by the query planner	After significant data loads; automatically runs via auto-analyze
WLM (Workload Management)	Routes queries to queues with memory/concurrency limits	Configure to separate ETL from BI queries
Concurrency Scaling	Adds burst capacity for concurrent queries	Enable for unpredictable BI query bursts

⚠️

A VACUUM FULL on a large table can run for hours and consumes cluster resources. Schedule VACUUM during off-peak hours or use VACUUM SORT ONLY / VACUUM DELETE ONLY to do partial maintenance. Auto-vacuum handles most cases automatically on modern Redshift clusters.

🎯

Interview Focus Points

1Explain Redshift's MPP architecture. How does it differ from a standard RDBMS like PostgreSQL?
2What is a distribution key and how do you choose one? What happens if you choose poorly?
3What is a sort key in Redshift and how does it improve query performance?
4When would you use Redshift Spectrum instead of loading data into Redshift?
5Compare Redshift Serverless vs provisioned clusters. When is each more appropriate?
6What is the COPY command and why is it preferred over INSERT for bulk data loading?
7Explain WLM in Redshift. How would you configure it to separate ETL from BI query workloads?
8Why does Redshift performance degrade over time and how do you maintain it?
9How does Zero-ETL from Aurora to Redshift work? What problem does it solve?
10A Redshift query that used to run in 30 seconds now takes 10 minutes. How do you diagnose it?