Ace Cloud Interviews
🗃️

AWS Database

Redshift

Petabyte-scale cloud data warehouse optimized for analytics workloads

Amazon Redshift is a fully managed petabyte-scale cloud data warehouse optimized for OLAP (Online Analytical Processing) workloads - aggregations, joins, and scans across hundreds of billions of rows. It uses a massively parallel processing (MPP) columnar storage architecture that is fundamentally different from row-oriented OLTP databases. Redshift is the central piece of most AWS analytics architectures, integrating with S3, Aurora, Glue, QuickSight, and SageMaker.

Redshift Architecture: Leader Node, Compute Nodes, and Slices

A Redshift cluster has one leader node and one or more compute nodes. The leader node receives queries, builds query plans, and coordinates execution. Compute nodes execute the plan in parallel across slices (virtual CPU cores within each node). Data is distributed across slices according to the distribution style.

ComponentRoleKey Detail
Leader nodeQuery parsing, planning, coordinationNot charged separately; no data stored here
Compute nodeParallel data storage and executionComes in Dense Compute (dc2) or RA3 node types
SliceUnit of parallel processing within a nodeEach node has 2-32 slices depending on type
Node storagedc2: local SSD; RA3: S3-backed managed storageRA3 separates compute from storage
Cluster endpointJDBC/ODBC connection pointAlways connects to the leader node
💡

RA3 nodes are the modern default. They use Redshift Managed Storage (RMS) backed by S3, so you can scale compute independently from storage. This makes RA3 clusters much easier to right-size than legacy dc2 clusters where compute and storage are coupled.

Distribution Keys and Sort Keys: The Core Performance Levers

Distribution style determines how rows are spread across slices. Sort key determines the physical order of data on disk. These two design choices have more impact on Redshift query performance than almost anything else.

Distribution StyleBehaviourBest For
EVENRound-robin across all slicesNo clear join key; tables not frequently joined
KEYRows with same key value go to same sliceLarge fact tables joined on a common key (reduces data redistribution)
ALLFull copy of table on every nodeSmall dimension tables joined frequently (eliminates network shuffle)
AUTOAWS chooses based on table size; changes over timeDefault for new tables; good starting point
Sort Key TypeBehaviourBest For
Compound sort keySorted on columns in order defined; prefix queries most efficientQueries that always filter on first column(s)
Interleaved sort keyEqual weight to all columns; no prefix advantageMultiple queries filtering on different columns (deprecated - use compound)
⚠️

Interleaved sort keys are deprecated and AWS recommends migrating to compound sort keys or AQUA. A common sort key pattern is (date_column, customer_id) for time-series fact tables - most queries filter by date range first.

Redshift Serverless vs Provisioned Clusters

Redshift Serverless eliminates cluster management entirely. It automatically scales capacity and charges per RPU-second (Redshift Processing Unit). Provisioned clusters give you more control over node types and pricing.

AttributeProvisionedServerless
Capacity managementChoose node type and countAutomatic scaling - no nodes to manage
PricingPer node-hour (+ storage for RA3)Per RPU-second (min 8 RPUs)
Idle costFull cost even when idleScales to zero after inactivity (configurable)
Concurrency scalingSeparate concurrency scaling clustersAutomatic
RA3 managed storageYesUses managed storage automatically
Best forSteady high-concurrency production DWDev/test, irregular workloads, new projects
💡

Redshift Serverless has a base RPU cost even when running queries - it does not drop to truly zero cost during a query. The minimum is 8 RPUs (approximately the cost of a small provisioned cluster). For 24/7 steady-state analytical workloads, provisioned with Reserved Instances is usually cheaper.

Redshift Spectrum and Zero-ETL Integrations

Redshift can query data directly in S3 without loading it (Spectrum), and can receive near-real-time data from Aurora and DynamoDB via Zero-ETL.

FeatureWhat It DoesUse Case
COPY commandBulk load from S3, DynamoDB, Kinesis, EMR into Redshift tablesInitial data load; daily batch ingestion
Redshift SpectrumQuery S3 data (Parquet, CSV, ORC) via external tables without loadingData lake queries; historical cold data in S3
Zero-ETL (Aurora)Near-real-time replication of Aurora data into Redshift automaticallyOperational analytics without DMS pipelines
Zero-ETL (DynamoDB)Replicate DynamoDB tables into RedshiftFlatten NoSQL data for SQL analytics
UNLOAD commandExport query results to S3 in parallelData export; feeding downstream pipelines
bash
-- COPY data from S3 (most efficient ingestion method)
COPY sales
FROM 's3://my-bucket/sales/2024/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Role'
FORMAT AS PARQUET;

-- Query S3 data directly with Spectrum
SELECT s.customer_id, SUM(s.amount)
FROM spectrum_schema.cold_sales s
JOIN public.customers c ON s.customer_id = c.id
WHERE s.sale_date >= '2023-01-01'
GROUP BY 1;

-- UNLOAD results to S3
UNLOAD ('SELECT * FROM sales WHERE sale_date = CURRENT_DATE')
TO 's3://my-bucket/daily-export/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3Role'
FORMAT PARQUET PARALLEL ON;

Performance Tuning: WLM, Vacuum, Analyze

Redshift performance degrades over time without maintenance. The three main operational tasks are WLM tuning (managing query queues), VACUUM (reclaiming space from deletes/updates), and ANALYZE (updating table statistics).

OperationPurposeWhen to Run
VACUUMReclaims space from deleted rows; re-sorts unsorted rowsAfter large delete/update operations; scheduled weekly
ANALYZEUpdates table statistics used by the query plannerAfter significant data loads; automatically runs via auto-analyze
WLM (Workload Management)Routes queries to queues with memory/concurrency limitsConfigure to separate ETL from BI queries
Concurrency ScalingAdds burst capacity for concurrent queriesEnable for unpredictable BI query bursts
⚠️

A VACUUM FULL on a large table can run for hours and consumes cluster resources. Schedule VACUUM during off-peak hours or use VACUUM SORT ONLY / VACUUM DELETE ONLY to do partial maintenance. Auto-vacuum handles most cases automatically on modern Redshift clusters.

🎯

Interview Focus Points

  • 1Explain Redshift's MPP architecture. How does it differ from a standard RDBMS like PostgreSQL?
  • 2What is a distribution key and how do you choose one? What happens if you choose poorly?
  • 3What is a sort key in Redshift and how does it improve query performance?
  • 4When would you use Redshift Spectrum instead of loading data into Redshift?
  • 5Compare Redshift Serverless vs provisioned clusters. When is each more appropriate?
  • 6What is the COPY command and why is it preferred over INSERT for bulk data loading?
  • 7Explain WLM in Redshift. How would you configure it to separate ETL from BI query workloads?
  • 8Why does Redshift performance degrade over time and how do you maintain it?
  • 9How does Zero-ETL from Aurora to Redshift work? What problem does it solve?
  • 10A Redshift query that used to run in 30 seconds now takes 10 minutes. How do you diagnose it?