AWS Analytics & Big Data
EMR
Big data processing with Hadoop, Spark, Hive, and Presto on managed clusters
Amazon EMR (Elastic MapReduce) is a managed cluster platform for running big data frameworks like Apache Spark, Hadoop, Hive, and Presto at petabyte scale. It provisions, configures, and auto-scales EC2 clusters so you can focus on processing logic rather than infrastructure. EMR is widely used in data engineering pipelines for ETL, machine learning feature engineering, and large-scale log analysis.
How EMR Clusters Execute Jobs
An EMR cluster consists of a primary node (coordinates jobs), core nodes (run tasks and store HDFS data), and optional task nodes (run tasks only, no HDFS). When you submit a job, the primary node distributes work across core and task nodes.
| Node Type | Role | HDFS Storage | Can Spot? |
|---|---|---|---|
| Primary | Coordinates YARN, HDFS NameNode, job history | No | Not recommended |
| Core | Runs tasks + stores HDFS blocks | Yes | Risky - data loss on termination |
| Task | Runs tasks only, no HDFS | No | Yes - safe to use Spot |
Use On-Demand for primary and core nodes. Spot instances are safe for task nodes because they hold no HDFS data - losing them only slows the job, it does not corrupt data.
EMR supports two storage modes: HDFS (ephemeral, fast, local to cluster) and EMRFS (S3-backed, persistent across cluster restarts). Most modern EMR architectures use EMRFS to decouple storage from compute.
Cluster vs Serverless vs Studio
AWS offers three EMR deployment modes. Choosing the right one depends on job duration, cost sensitivity, and interactivity needs.
| Mode | Use Case | Cold Start | Cost Model |
|---|---|---|---|
| EMR on EC2 | Long-running clusters, custom AMIs, full control | Minutes | Pay per EC2 second |
| EMR Serverless | Ephemeral batch jobs, no cluster management | 1-2 min | Pay per vCPU/GB-hour of job runtime |
| EMR on EKS | Run Spark on existing EKS clusters | Seconds (pod) | Pay for EKS worker nodes |
| EMR Studio | Interactive notebooks (JupyterLab) | N/A | Pay for underlying cluster |
EMR Serverless is the best default choice for new batch workloads - no cluster to manage, and you only pay while a job runs. The main limitation is no persistent HDFS and limited custom application support.
Running Apache Spark on EMR
Spark is the most common EMR workload. You can submit jobs via spark-submit, EMR Steps, or the EMR API.
# Submit a Spark job as an EMR Step
aws emr add-steps \
--cluster-id j-XXXXXXXXXXXX \
--steps Type=Spark,Name="ETL Job",\
ActionOnFailure=CONTINUE,\
Args=[--deploy-mode,cluster,\
--class,com.example.ETLJob,\
s3://my-bucket/jars/etl.jar,\
s3://my-bucket/input/,\
s3://my-bucket/output/]Key Spark tuning parameters for EMR:
| Parameter | Purpose | Typical Value |
|---|---|---|
| spark.executor.memory | Memory per executor | 4-16g depending on instance type |
| spark.executor.cores | Cores per executor | 2-4 (leave 1 for YARN overhead) |
| spark.dynamicAllocation.enabled | Scale executors with load | true for variable workloads |
| spark.sql.adaptive.enabled | Adaptive query execution | true (default in Spark 3.x) |
| spark.sql.shuffle.partitions | Output partitions after shuffle | 200 default, tune to data size |
The default spark.sql.shuffle.partitions=200 causes small file problems with large datasets. Set it to roughly (input size GB) * 2 for better performance. Too many partitions slows S3 writes; too few causes OOM errors.
Pricing Model and Cost Optimization
EMR pricing has two components: the underlying EC2 cost plus an EMR surcharge per instance-hour. The surcharge varies by instance type (roughly 25-75% on top of EC2 On-Demand price).
| Strategy | Savings | Tradeoff |
|---|---|---|
| Spot for task nodes | 60-90% | Job runs slower if Spot is reclaimed |
| Reserved Instances for core nodes | 30-60% | Requires 1-3 year commitment |
| EMR Serverless vs always-on cluster | Up to 80% for intermittent jobs | Cold start latency per job |
| Instance fleets with multiple types | 20-40% | More complex configuration |
| Auto-scaling task nodes | 10-30% | Scaling lag can affect SLAs |
For jobs that run a few hours per day, EMR Serverless almost always beats a persistent cluster on cost. For jobs running more than 8 hours per day, a persistent cluster with Reserved Instances wins.
Common EMR Architectures
EMR fits into data pipelines in several well-established patterns:
| Pattern | Description | Tools |
|---|---|---|
| S3 Data Lake ETL | Read raw S3 data, transform with Spark, write Parquet back to S3 | EMR + Spark + Glue Catalog |
| Lambda Architecture | Batch layer (EMR) + speed layer (Kinesis) merge for queries | EMR + Kinesis + DynamoDB |
| Feature Store Pipeline | Transform raw events into ML features nightly | EMR Spark + SageMaker Feature Store |
| Log Aggregation | Process CloudWatch/VPC Flow logs at scale | EMR + Hive or Spark SQL |
| Transient Clusters | Spin up per job, terminate when done, data in S3 | EMR Steps + S3 + Glue Catalog |
Interview Focus Points
- 1What is the difference between core nodes and task nodes, and why is Spot safe for task nodes but risky for core nodes?
- 2When would you choose EMR Serverless over a persistent EC2 cluster?
- 3How do you tune Spark shuffle partitions and why does the default of 200 cause problems at scale?
- 4Explain EMRFS vs HDFS - when do you use each and what are the trade-offs?
- 5How does EMR auto-scaling work, and what metrics trigger scale-out?
- 6What is the EMR transient cluster pattern and why is it preferred over long-running clusters for batch workloads?
- 7How do you handle Spot interruptions in an EMR Spark job without losing all progress?
- 8How does EMR integrate with the Glue Data Catalog for schema management?