EMR

Big data processing with Hadoop, Spark, Hive, and Presto on managed clusters

Amazon EMR (Elastic MapReduce) is a managed cluster platform for running big data frameworks like Apache Spark, Hadoop, Hive, and Presto at petabyte scale. It provisions, configures, and auto-scales EC2 clusters so you can focus on processing logic rather than infrastructure. EMR is widely used in data engineering pipelines for ETL, machine learning feature engineering, and large-scale log analysis.

How EMR Clusters Execute Jobs

An EMR cluster consists of a primary node (coordinates jobs), core nodes (run tasks and store HDFS data), and optional task nodes (run tasks only, no HDFS). When you submit a job, the primary node distributes work across core and task nodes.

Node Type	Role	HDFS Storage	Can Spot?
Primary	Coordinates YARN, HDFS NameNode, job history	No	Not recommended
Core	Runs tasks + stores HDFS blocks	Yes	Risky - data loss on termination
Task	Runs tasks only, no HDFS	No	Yes - safe to use Spot

💡

Use On-Demand for primary and core nodes. Spot instances are safe for task nodes because they hold no HDFS data - losing them only slows the job, it does not corrupt data.

EMR supports two storage modes: HDFS (ephemeral, fast, local to cluster) and EMRFS (S3-backed, persistent across cluster restarts). Most modern EMR architectures use EMRFS to decouple storage from compute.

Cluster vs Serverless vs Studio

AWS offers three EMR deployment modes. Choosing the right one depends on job duration, cost sensitivity, and interactivity needs.

Mode	Use Case	Cold Start	Cost Model
EMR on EC2	Long-running clusters, custom AMIs, full control	Minutes	Pay per EC2 second
EMR Serverless	Ephemeral batch jobs, no cluster management	1-2 min	Pay per vCPU/GB-hour of job runtime
EMR on EKS	Run Spark on existing EKS clusters	Seconds (pod)	Pay for EKS worker nodes
EMR Studio	Interactive notebooks (JupyterLab)	N/A	Pay for underlying cluster

💡

EMR Serverless is the best default choice for new batch workloads - no cluster to manage, and you only pay while a job runs. The main limitation is no persistent HDFS and limited custom application support.

Running Apache Spark on EMR

Spark is the most common EMR workload. You can submit jobs via spark-submit, EMR Steps, or the EMR API.

bash

# Submit a Spark job as an EMR Step
aws emr add-steps \
  --cluster-id j-XXXXXXXXXXXX \
  --steps Type=Spark,Name="ETL Job",\
ActionOnFailure=CONTINUE,\
Args=[--deploy-mode,cluster,\
--class,com.example.ETLJob,\
s3://my-bucket/jars/etl.jar,\
s3://my-bucket/input/,\
s3://my-bucket/output/]

Key Spark tuning parameters for EMR:

Parameter	Purpose	Typical Value
spark.executor.memory	Memory per executor	4-16g depending on instance type
spark.executor.cores	Cores per executor	2-4 (leave 1 for YARN overhead)
spark.dynamicAllocation.enabled	Scale executors with load	true for variable workloads
spark.sql.adaptive.enabled	Adaptive query execution	true (default in Spark 3.x)
spark.sql.shuffle.partitions	Output partitions after shuffle	200 default, tune to data size

⚠️

The default spark.sql.shuffle.partitions=200 causes small file problems with large datasets. Set it to roughly (input size GB) * 2 for better performance. Too many partitions slows S3 writes; too few causes OOM errors.

Pricing Model and Cost Optimization

EMR pricing has two components: the underlying EC2 cost plus an EMR surcharge per instance-hour. The surcharge varies by instance type (roughly 25-75% on top of EC2 On-Demand price).

Strategy	Savings	Tradeoff
Spot for task nodes	60-90%	Job runs slower if Spot is reclaimed
Reserved Instances for core nodes	30-60%	Requires 1-3 year commitment
EMR Serverless vs always-on cluster	Up to 80% for intermittent jobs	Cold start latency per job
Instance fleets with multiple types	20-40%	More complex configuration
Auto-scaling task nodes	10-30%	Scaling lag can affect SLAs

💡

For jobs that run a few hours per day, EMR Serverless almost always beats a persistent cluster on cost. For jobs running more than 8 hours per day, a persistent cluster with Reserved Instances wins.

Common EMR Architectures

EMR fits into data pipelines in several well-established patterns:

Pattern	Description	Tools
S3 Data Lake ETL	Read raw S3 data, transform with Spark, write Parquet back to S3	EMR + Spark + Glue Catalog
Lambda Architecture	Batch layer (EMR) + speed layer (Kinesis) merge for queries	EMR + Kinesis + DynamoDB
Feature Store Pipeline	Transform raw events into ML features nightly	EMR Spark + SageMaker Feature Store
Log Aggregation	Process CloudWatch/VPC Flow logs at scale	EMR + Hive or Spark SQL
Transient Clusters	Spin up per job, terminate when done, data in S3	EMR Steps + S3 + Glue Catalog

🎯

Interview Focus Points

1What is the difference between core nodes and task nodes, and why is Spot safe for task nodes but risky for core nodes?
2When would you choose EMR Serverless over a persistent EC2 cluster?
3How do you tune Spark shuffle partitions and why does the default of 200 cause problems at scale?
4Explain EMRFS vs HDFS - when do you use each and what are the trade-offs?
5How does EMR auto-scaling work, and what metrics trigger scale-out?
6What is the EMR transient cluster pattern and why is it preferred over long-running clusters for batch workloads?
7How do you handle Spot interruptions in an EMR Spark job without losing all progress?
8How does EMR integrate with the Glue Data Catalog for schema management?