AWS Analytics & Big Data
Glue
Serverless ETL to discover, prepare, and combine data for analytics and ML
AWS Glue is a fully serverless ETL service that discovers, catalogs, and transforms data across S3, databases, and data warehouses without managing any infrastructure. At its core are three components: the Glue Data Catalog (a central metadata store), Glue ETL jobs (Spark or Python Shell running serverless), and Glue Crawlers (schema discovery agents). Glue is the backbone of most AWS data lake architectures and integrates natively with Athena, EMR, Redshift, and Lake Formation.
Glue Architecture - Catalog, Crawlers, and Jobs
Glue has three main components that work together:
| Component | What It Does | Key Concepts |
|---|---|---|
| Data Catalog | Central metadata store - databases, tables, schemas, partitions | Compatible with Hive Metastore; used by Athena, EMR, Redshift Spectrum |
| Crawlers | Scan S3, JDBC, DynamoDB and infer schema automatically | Classifier chain, schema versioning, partition detection |
| ETL Jobs | Run Spark (Glue ETL) or Python (Glue Python Shell) transformations | DPUs for Spark, max capacity for Python |
| Workflows | Orchestrate crawlers + jobs with triggers and dependencies | On-schedule, on-demand, or event-triggered |
| DataBrew | Visual no-code data preparation tool | Separate product - for non-engineers |
The Glue Data Catalog is a shared resource - Athena, EMR, and Redshift Spectrum all use it as their default metastore. A table defined once in Glue is immediately queryable from all three services. This is the core value of Glue in a data lake.
Glue ETL Jobs - DPUs, Worker Types, and Script Generation
Glue ETL jobs run Apache Spark on managed infrastructure. You do not provision or configure clusters - you choose a worker type and number of workers (or let Glue auto-scale).
| Worker Type | vCPU | Memory | Storage | Best For |
|---|---|---|---|---|
| Standard | 4 vCPU | 16 GB | 50 GB disk | Legacy - use G.1X instead |
| G.1X | 4 vCPU | 16 GB | 64 GB NVMe | Memory-intensive transforms |
| G.2X | 8 vCPU | 32 GB | 128 GB NVMe | ML transforms, heavy shuffles |
| G.4X | 16 vCPU | 64 GB | 256 GB NVMe | Very large datasets |
| G.8X | 32 vCPU | 128 GB | 512 GB NVMe | Maximum single-job throughput |
Glue generates PySpark code from visual transforms in Glue Studio. You can also write custom PySpark or Scala scripts. Glue adds helper classes (GlueContext, DynamicFrame) on top of standard Spark.
# Example Glue ETL script structure
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read from Glue Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
database="my_database",
table_name="raw_events"
)
# Apply mapping/transform
mapped = ApplyMapping.apply(
frame=datasource,
mappings=[
("user_id", "string", "userId", "string"),
("event_ts", "string", "eventTime", "timestamp")
]
)
# Write to S3 in Parquet
glueContext.write_dynamic_frame.from_options(
frame=mapped,
connection_type="s3",
connection_options={"path": "s3://my-lake/processed/"},
format="parquet"
)
job.commit()DynamicFrames are Glue's wrapper around Spark DataFrames. They handle schema inconsistencies (missing columns, mixed types) more gracefully but are slower than native DataFrames. Convert to DataFrame with .toDF() for complex transforms and convert back with fromDF() when writing.
Glue Crawlers - Schema Discovery and Partition Management
Crawlers scan data sources and populate the Glue Data Catalog with table definitions, column types, and partition metadata. They run on a schedule you define.
| Crawler Behavior | Details |
|---|---|
| Schema inference | Samples files and infers column names/types using classifiers |
| Partition detection | Detects Hive-style partitions (e.g. s3://bucket/year=2025/month=01/) |
| Schema change behavior | Add new columns, update existing, or ignore - configurable |
| Supported sources | S3, JDBC (RDS, Redshift), DynamoDB, Delta Lake, Iceberg |
| Custom classifiers | Define Grok patterns for custom formats (e.g. custom log files) |
Running crawlers on every new data file is expensive and slow. For structured pipelines where you control the schema, skip crawlers entirely and define Glue tables manually with partition projection (Athena) or explicit DDL. Use crawlers only for discovery of unknown or evolving schemas.
Glue Pricing Model
Glue charges per Data Processing Unit (DPU) hour for ETL jobs and per DPU hour for crawlers.
| Component | Price | Notes |
|---|---|---|
| Glue ETL job (G.1X worker) | $0.44/DPU-hour | 1 DPU = 4 vCPU + 16 GB RAM; 10-min minimum billing |
| Glue Python Shell job | $0.44/DPU-hour | 0.0625 or 1 DPU; good for lightweight scripts |
| Glue Crawler | $0.44/DPU-hour | Minimum 10 minutes per crawl run |
| Glue Data Catalog | $1/100,000 objects/month | First 1M objects free |
| Glue DataBrew | $1/node-hour | Separate product |
The 10-minute billing minimum for Glue jobs means short jobs (under 10 min) are rounded up. For very short transforms (seconds), consider Lambda or a Python script on EC2 instead. Glue shines for jobs that take 10+ minutes to run.
Glue ETL vs EMR - Choosing the Right Tool
| Dimension | Glue ETL | EMR |
|---|---|---|
| Infrastructure | Fully serverless - no cluster config | You choose instance types and cluster size |
| Language support | PySpark, Scala, Python Shell | Spark, Hive, Presto, HBase, Flink, and more |
| Startup time | 2-4 minutes per job | 5-15 minutes for cluster |
| Cost for large jobs | Higher per DPU-hour | Lower with Spot instances |
| Debugging | Glue Studio, CloudWatch, SparkUI | Full Spark UI, YARN UI, SSH access |
| Custom libraries | Supported via S3 wheel upload | Full control via bootstrap actions |
| Catalog integration | Native - built on Glue Catalog | Requires Glue Catalog configuration |
Glue is better for teams that want managed ETL without Spark expertise. EMR is better for teams with complex Spark workloads, custom frameworks, or cost sensitivity at scale. Many organizations use both: Glue for simple ETL and EMR for heavy data engineering.
Interview Focus Points
- 1What is the Glue Data Catalog and why is it important for a multi-service data lake architecture?
- 2Explain the difference between a Glue DynamicFrame and a Spark DataFrame - when do you convert between them?
- 3What are the trade-offs between using Glue ETL and EMR for Spark jobs?
- 4When would you use a Glue Crawler vs manually defining a Glue table?
- 5How does Glue integrate with Athena, EMR, and Redshift Spectrum?
- 6Explain Glue job bookmarks - what problem do they solve and how do they work?
- 7How do you handle schema evolution in Glue ETL when upstream data adds new columns?
- 8What is the 10-minute billing minimum and how does it affect Glue job design?