Batch

Fully managed batch computing at any scale using EC2 or Fargate

AWS Batch is a fully managed service for running batch computing jobs at any scale. You submit jobs and Batch automatically provisions the optimal compute resources (EC2, Spot, or Fargate), schedules the work, and cleans up after completion. It eliminates the need to install and manage batch computing software.

Core Components

Component	Description
Job	A unit of work - a containerized application with a command to run, memory/CPU requirements, and environment variables.
Job Definition	A blueprint for a job - container image, resource requirements, retry strategy, timeout, mount points. Versioned.
Job Queue	Jobs are submitted to a queue. Each queue has a priority and is associated with one or more Compute Environments.
Compute Environment	The EC2 or Fargate infrastructure where jobs run. Managed (AWS provisions/scales) or Unmanaged (you manage the instances).
Scheduling Policy	Fair share scheduling controls how resources are allocated between users or groups sharing a compute environment.

Compute Environments

Managed compute environments let you specify the instance types, vCPU range, and whether to use On-Demand or Spot. AWS handles fleet management.

EC2 Managed: specify minimum/desired/maximum vCPUs. AWS launches and terminates instances as jobs arrive and complete.
Spot Managed: same as EC2 but uses Spot Instances for up to 90% cost savings. Best for fault-tolerant batch workloads.
Fargate: jobs run in serverless containers - no EC2 instances to manage. Better for shorter, smaller jobs; slower to start for large bursts.
Unmanaged: you launch and manage the EC2 instances yourself, then register them with the compute environment using the ECS agent.

💡

Spot is highly recommended for Batch workloads. Jobs that fail due to Spot interruption are automatically retried. Set automatic retries (up to 10) in the job definition.

Advanced Job Patterns

Array Jobs: run the same job definition N times in parallel, each with a unique index. Ideal for parameter sweeps, rendering frames, parallel data processing.
Job Dependencies: a job can depend on the success of another job (or an array job element) before starting. Enables DAG-like pipelines.
Multi-node Parallel Jobs: run a single job across multiple nodes for tightly coupled distributed workloads like MPI applications.
Job Timeouts: automatically fail (and optionally retry) jobs that exceed a maximum duration.
EventBridge Integration: Batch emits events on job state changes (SUBMITTED, PENDING, RUNNABLE, STARTING, RUNNING, FAILED, SUCCEEDED) to EventBridge for monitoring and alerting.

Common Use Cases

ML training pipelines: train models on large datasets, one job per experiment or fold
Video transcoding and image rendering: process each asset as a separate array job element
ETL data processing: nightly or hourly batch transformations on data in S3
Financial risk calculations: Monte Carlo simulations, portfolio analysis
Genomics and scientific computing: sequence alignment, variant calling pipelines

🎯

Interview Focus Points

1Batch vs Lambda - when does Batch make more sense than Lambda for background processing?
2How do Array Jobs work and what problem do they solve?
3Managed vs Unmanaged Compute Environments - tradeoffs
4How does Batch handle Spot interruptions?
5Job queue priorities - how does Batch decide which queue to pull from?