Glue

Serverless ETL to discover, prepare, and combine data for analytics and ML

AWS Glue is a fully serverless ETL service that discovers, catalogs, and transforms data across S3, databases, and data warehouses without managing any infrastructure. At its core are three components: the Glue Data Catalog (a central metadata store), Glue ETL jobs (Spark or Python Shell running serverless), and Glue Crawlers (schema discovery agents). Glue is the backbone of most AWS data lake architectures and integrates natively with Athena, EMR, Redshift, and Lake Formation.

Glue Architecture - Catalog, Crawlers, and Jobs

Glue has three main components that work together:

Component	What It Does	Key Concepts
Data Catalog	Central metadata store - databases, tables, schemas, partitions	Compatible with Hive Metastore; used by Athena, EMR, Redshift Spectrum
Crawlers	Scan S3, JDBC, DynamoDB and infer schema automatically	Classifier chain, schema versioning, partition detection
ETL Jobs	Run Spark (Glue ETL) or Python (Glue Python Shell) transformations	DPUs for Spark, max capacity for Python
Workflows	Orchestrate crawlers + jobs with triggers and dependencies	On-schedule, on-demand, or event-triggered
DataBrew	Visual no-code data preparation tool	Separate product - for non-engineers

💡

The Glue Data Catalog is a shared resource - Athena, EMR, and Redshift Spectrum all use it as their default metastore. A table defined once in Glue is immediately queryable from all three services. This is the core value of Glue in a data lake.

Glue ETL Jobs - DPUs, Worker Types, and Script Generation

Glue ETL jobs run Apache Spark on managed infrastructure. You do not provision or configure clusters - you choose a worker type and number of workers (or let Glue auto-scale).

Worker Type	vCPU	Memory	Storage	Best For
Standard	4 vCPU	16 GB	50 GB disk	Legacy - use G.1X instead
G.1X	4 vCPU	16 GB	64 GB NVMe	Memory-intensive transforms
G.2X	8 vCPU	32 GB	128 GB NVMe	ML transforms, heavy shuffles
G.4X	16 vCPU	64 GB	256 GB NVMe	Very large datasets
G.8X	32 vCPU	128 GB	512 GB NVMe	Maximum single-job throughput

Glue generates PySpark code from visual transforms in Glue Studio. You can also write custom PySpark or Scala scripts. Glue adds helper classes (GlueContext, DynamicFrame) on top of standard Spark.

bash

# Example Glue ETL script structure
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from Glue Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="my_database",
    table_name="raw_events"
)

# Apply mapping/transform
mapped = ApplyMapping.apply(
    frame=datasource,
    mappings=[
        ("user_id", "string", "userId", "string"),
        ("event_ts", "string", "eventTime", "timestamp")
    ]
)

# Write to S3 in Parquet
glueContext.write_dynamic_frame.from_options(
    frame=mapped,
    connection_type="s3",
    connection_options={"path": "s3://my-lake/processed/"},
    format="parquet"
)
job.commit()

⚠️

DynamicFrames are Glue's wrapper around Spark DataFrames. They handle schema inconsistencies (missing columns, mixed types) more gracefully but are slower than native DataFrames. Convert to DataFrame with .toDF() for complex transforms and convert back with fromDF() when writing.

Glue Crawlers - Schema Discovery and Partition Management

Crawlers scan data sources and populate the Glue Data Catalog with table definitions, column types, and partition metadata. They run on a schedule you define.

Crawler Behavior	Details
Schema inference	Samples files and infers column names/types using classifiers
Partition detection	Detects Hive-style partitions (e.g. s3://bucket/year=2025/month=01/)
Schema change behavior	Add new columns, update existing, or ignore - configurable
Supported sources	S3, JDBC (RDS, Redshift), DynamoDB, Delta Lake, Iceberg
Custom classifiers	Define Grok patterns for custom formats (e.g. custom log files)

💡

Running crawlers on every new data file is expensive and slow. For structured pipelines where you control the schema, skip crawlers entirely and define Glue tables manually with partition projection (Athena) or explicit DDL. Use crawlers only for discovery of unknown or evolving schemas.

Glue Pricing Model

Glue charges per Data Processing Unit (DPU) hour for ETL jobs and per DPU hour for crawlers.

Component	Price	Notes
Glue ETL job (G.1X worker)	$0.44/DPU-hour	1 DPU = 4 vCPU + 16 GB RAM; 10-min minimum billing
Glue Python Shell job	$0.44/DPU-hour	0.0625 or 1 DPU; good for lightweight scripts
Glue Crawler	$0.44/DPU-hour	Minimum 10 minutes per crawl run
Glue Data Catalog	$1/100,000 objects/month	First 1M objects free
Glue DataBrew	$1/node-hour	Separate product

💡

The 10-minute billing minimum for Glue jobs means short jobs (under 10 min) are rounded up. For very short transforms (seconds), consider Lambda or a Python script on EC2 instead. Glue shines for jobs that take 10+ minutes to run.

Glue ETL vs EMR - Choosing the Right Tool

Dimension	Glue ETL	EMR
Infrastructure	Fully serverless - no cluster config	You choose instance types and cluster size
Language support	PySpark, Scala, Python Shell	Spark, Hive, Presto, HBase, Flink, and more
Startup time	2-4 minutes per job	5-15 minutes for cluster
Cost for large jobs	Higher per DPU-hour	Lower with Spot instances
Debugging	Glue Studio, CloudWatch, SparkUI	Full Spark UI, YARN UI, SSH access
Custom libraries	Supported via S3 wheel upload	Full control via bootstrap actions
Catalog integration	Native - built on Glue Catalog	Requires Glue Catalog configuration

💡

Glue is better for teams that want managed ETL without Spark expertise. EMR is better for teams with complex Spark workloads, custom frameworks, or cost sensitivity at scale. Many organizations use both: Glue for simple ETL and EMR for heavy data engineering.

🎯

Interview Focus Points

1What is the Glue Data Catalog and why is it important for a multi-service data lake architecture?
2Explain the difference between a Glue DynamicFrame and a Spark DataFrame - when do you convert between them?
3What are the trade-offs between using Glue ETL and EMR for Spark jobs?
4When would you use a Glue Crawler vs manually defining a Glue table?
5How does Glue integrate with Athena, EMR, and Redshift Spectrum?
6Explain Glue job bookmarks - what problem do they solve and how do they work?
7How do you handle schema evolution in Glue ETL when upstream data adds new columns?
8What is the 10-minute billing minimum and how does it affect Glue job design?