Ace Cloud Interviews
📈

AWS Analytics & Big Data

EMR

Big data processing with Hadoop, Spark, Hive, and Presto on managed clusters

Amazon EMR (Elastic MapReduce) is a managed cluster platform for running big data frameworks like Apache Spark, Hadoop, Hive, and Presto at petabyte scale. It provisions, configures, and auto-scales EC2 clusters so you can focus on processing logic rather than infrastructure. EMR is widely used in data engineering pipelines for ETL, machine learning feature engineering, and large-scale log analysis.

How EMR Clusters Execute Jobs

An EMR cluster consists of a primary node (coordinates jobs), core nodes (run tasks and store HDFS data), and optional task nodes (run tasks only, no HDFS). When you submit a job, the primary node distributes work across core and task nodes.

Node TypeRoleHDFS StorageCan Spot?
PrimaryCoordinates YARN, HDFS NameNode, job historyNoNot recommended
CoreRuns tasks + stores HDFS blocksYesRisky - data loss on termination
TaskRuns tasks only, no HDFSNoYes - safe to use Spot
💡

Use On-Demand for primary and core nodes. Spot instances are safe for task nodes because they hold no HDFS data - losing them only slows the job, it does not corrupt data.

EMR supports two storage modes: HDFS (ephemeral, fast, local to cluster) and EMRFS (S3-backed, persistent across cluster restarts). Most modern EMR architectures use EMRFS to decouple storage from compute.

Cluster vs Serverless vs Studio

AWS offers three EMR deployment modes. Choosing the right one depends on job duration, cost sensitivity, and interactivity needs.

ModeUse CaseCold StartCost Model
EMR on EC2Long-running clusters, custom AMIs, full controlMinutesPay per EC2 second
EMR ServerlessEphemeral batch jobs, no cluster management1-2 minPay per vCPU/GB-hour of job runtime
EMR on EKSRun Spark on existing EKS clustersSeconds (pod)Pay for EKS worker nodes
EMR StudioInteractive notebooks (JupyterLab)N/APay for underlying cluster
💡

EMR Serverless is the best default choice for new batch workloads - no cluster to manage, and you only pay while a job runs. The main limitation is no persistent HDFS and limited custom application support.

Running Apache Spark on EMR

Spark is the most common EMR workload. You can submit jobs via spark-submit, EMR Steps, or the EMR API.

bash
# Submit a Spark job as an EMR Step
aws emr add-steps \
  --cluster-id j-XXXXXXXXXXXX \
  --steps Type=Spark,Name="ETL Job",\
ActionOnFailure=CONTINUE,\
Args=[--deploy-mode,cluster,\
--class,com.example.ETLJob,\
s3://my-bucket/jars/etl.jar,\
s3://my-bucket/input/,\
s3://my-bucket/output/]

Key Spark tuning parameters for EMR:

ParameterPurposeTypical Value
spark.executor.memoryMemory per executor4-16g depending on instance type
spark.executor.coresCores per executor2-4 (leave 1 for YARN overhead)
spark.dynamicAllocation.enabledScale executors with loadtrue for variable workloads
spark.sql.adaptive.enabledAdaptive query executiontrue (default in Spark 3.x)
spark.sql.shuffle.partitionsOutput partitions after shuffle200 default, tune to data size
⚠️

The default spark.sql.shuffle.partitions=200 causes small file problems with large datasets. Set it to roughly (input size GB) * 2 for better performance. Too many partitions slows S3 writes; too few causes OOM errors.

Pricing Model and Cost Optimization

EMR pricing has two components: the underlying EC2 cost plus an EMR surcharge per instance-hour. The surcharge varies by instance type (roughly 25-75% on top of EC2 On-Demand price).

StrategySavingsTradeoff
Spot for task nodes60-90%Job runs slower if Spot is reclaimed
Reserved Instances for core nodes30-60%Requires 1-3 year commitment
EMR Serverless vs always-on clusterUp to 80% for intermittent jobsCold start latency per job
Instance fleets with multiple types20-40%More complex configuration
Auto-scaling task nodes10-30%Scaling lag can affect SLAs
💡

For jobs that run a few hours per day, EMR Serverless almost always beats a persistent cluster on cost. For jobs running more than 8 hours per day, a persistent cluster with Reserved Instances wins.

Common EMR Architectures

EMR fits into data pipelines in several well-established patterns:

PatternDescriptionTools
S3 Data Lake ETLRead raw S3 data, transform with Spark, write Parquet back to S3EMR + Spark + Glue Catalog
Lambda ArchitectureBatch layer (EMR) + speed layer (Kinesis) merge for queriesEMR + Kinesis + DynamoDB
Feature Store PipelineTransform raw events into ML features nightlyEMR Spark + SageMaker Feature Store
Log AggregationProcess CloudWatch/VPC Flow logs at scaleEMR + Hive or Spark SQL
Transient ClustersSpin up per job, terminate when done, data in S3EMR Steps + S3 + Glue Catalog
🎯

Interview Focus Points

  • 1What is the difference between core nodes and task nodes, and why is Spot safe for task nodes but risky for core nodes?
  • 2When would you choose EMR Serverless over a persistent EC2 cluster?
  • 3How do you tune Spark shuffle partitions and why does the default of 200 cause problems at scale?
  • 4Explain EMRFS vs HDFS - when do you use each and what are the trade-offs?
  • 5How does EMR auto-scaling work, and what metrics trigger scale-out?
  • 6What is the EMR transient cluster pattern and why is it preferred over long-running clusters for batch workloads?
  • 7How do you handle Spot interruptions in an EMR Spark job without losing all progress?
  • 8How does EMR integrate with the Glue Data Catalog for schema management?