Ace Cloud Interviews
📈

AWS Analytics & Big Data

Glue

Serverless ETL to discover, prepare, and combine data for analytics and ML

AWS Glue is a fully serverless ETL service that discovers, catalogs, and transforms data across S3, databases, and data warehouses without managing any infrastructure. At its core are three components: the Glue Data Catalog (a central metadata store), Glue ETL jobs (Spark or Python Shell running serverless), and Glue Crawlers (schema discovery agents). Glue is the backbone of most AWS data lake architectures and integrates natively with Athena, EMR, Redshift, and Lake Formation.

Glue Architecture - Catalog, Crawlers, and Jobs

Glue has three main components that work together:

ComponentWhat It DoesKey Concepts
Data CatalogCentral metadata store - databases, tables, schemas, partitionsCompatible with Hive Metastore; used by Athena, EMR, Redshift Spectrum
CrawlersScan S3, JDBC, DynamoDB and infer schema automaticallyClassifier chain, schema versioning, partition detection
ETL JobsRun Spark (Glue ETL) or Python (Glue Python Shell) transformationsDPUs for Spark, max capacity for Python
WorkflowsOrchestrate crawlers + jobs with triggers and dependenciesOn-schedule, on-demand, or event-triggered
DataBrewVisual no-code data preparation toolSeparate product - for non-engineers
💡

The Glue Data Catalog is a shared resource - Athena, EMR, and Redshift Spectrum all use it as their default metastore. A table defined once in Glue is immediately queryable from all three services. This is the core value of Glue in a data lake.

Glue ETL Jobs - DPUs, Worker Types, and Script Generation

Glue ETL jobs run Apache Spark on managed infrastructure. You do not provision or configure clusters - you choose a worker type and number of workers (or let Glue auto-scale).

Worker TypevCPUMemoryStorageBest For
Standard4 vCPU16 GB50 GB diskLegacy - use G.1X instead
G.1X4 vCPU16 GB64 GB NVMeMemory-intensive transforms
G.2X8 vCPU32 GB128 GB NVMeML transforms, heavy shuffles
G.4X16 vCPU64 GB256 GB NVMeVery large datasets
G.8X32 vCPU128 GB512 GB NVMeMaximum single-job throughput

Glue generates PySpark code from visual transforms in Glue Studio. You can also write custom PySpark or Scala scripts. Glue adds helper classes (GlueContext, DynamicFrame) on top of standard Spark.

bash
# Example Glue ETL script structure
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read from Glue Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database="my_database",
    table_name="raw_events"
)

# Apply mapping/transform
mapped = ApplyMapping.apply(
    frame=datasource,
    mappings=[
        ("user_id", "string", "userId", "string"),
        ("event_ts", "string", "eventTime", "timestamp")
    ]
)

# Write to S3 in Parquet
glueContext.write_dynamic_frame.from_options(
    frame=mapped,
    connection_type="s3",
    connection_options={"path": "s3://my-lake/processed/"},
    format="parquet"
)
job.commit()
⚠️

DynamicFrames are Glue's wrapper around Spark DataFrames. They handle schema inconsistencies (missing columns, mixed types) more gracefully but are slower than native DataFrames. Convert to DataFrame with .toDF() for complex transforms and convert back with fromDF() when writing.

Glue Crawlers - Schema Discovery and Partition Management

Crawlers scan data sources and populate the Glue Data Catalog with table definitions, column types, and partition metadata. They run on a schedule you define.

Crawler BehaviorDetails
Schema inferenceSamples files and infers column names/types using classifiers
Partition detectionDetects Hive-style partitions (e.g. s3://bucket/year=2025/month=01/)
Schema change behaviorAdd new columns, update existing, or ignore - configurable
Supported sourcesS3, JDBC (RDS, Redshift), DynamoDB, Delta Lake, Iceberg
Custom classifiersDefine Grok patterns for custom formats (e.g. custom log files)
💡

Running crawlers on every new data file is expensive and slow. For structured pipelines where you control the schema, skip crawlers entirely and define Glue tables manually with partition projection (Athena) or explicit DDL. Use crawlers only for discovery of unknown or evolving schemas.

Glue Pricing Model

Glue charges per Data Processing Unit (DPU) hour for ETL jobs and per DPU hour for crawlers.

ComponentPriceNotes
Glue ETL job (G.1X worker)$0.44/DPU-hour1 DPU = 4 vCPU + 16 GB RAM; 10-min minimum billing
Glue Python Shell job$0.44/DPU-hour0.0625 or 1 DPU; good for lightweight scripts
Glue Crawler$0.44/DPU-hourMinimum 10 minutes per crawl run
Glue Data Catalog$1/100,000 objects/monthFirst 1M objects free
Glue DataBrew$1/node-hourSeparate product
💡

The 10-minute billing minimum for Glue jobs means short jobs (under 10 min) are rounded up. For very short transforms (seconds), consider Lambda or a Python script on EC2 instead. Glue shines for jobs that take 10+ minutes to run.

Glue ETL vs EMR - Choosing the Right Tool

DimensionGlue ETLEMR
InfrastructureFully serverless - no cluster configYou choose instance types and cluster size
Language supportPySpark, Scala, Python ShellSpark, Hive, Presto, HBase, Flink, and more
Startup time2-4 minutes per job5-15 minutes for cluster
Cost for large jobsHigher per DPU-hourLower with Spot instances
DebuggingGlue Studio, CloudWatch, SparkUIFull Spark UI, YARN UI, SSH access
Custom librariesSupported via S3 wheel uploadFull control via bootstrap actions
Catalog integrationNative - built on Glue CatalogRequires Glue Catalog configuration
💡

Glue is better for teams that want managed ETL without Spark expertise. EMR is better for teams with complex Spark workloads, custom frameworks, or cost sensitivity at scale. Many organizations use both: Glue for simple ETL and EMR for heavy data engineering.

🎯

Interview Focus Points

  • 1What is the Glue Data Catalog and why is it important for a multi-service data lake architecture?
  • 2Explain the difference between a Glue DynamicFrame and a Spark DataFrame - when do you convert between them?
  • 3What are the trade-offs between using Glue ETL and EMR for Spark jobs?
  • 4When would you use a Glue Crawler vs manually defining a Glue table?
  • 5How does Glue integrate with Athena, EMR, and Redshift Spectrum?
  • 6Explain Glue job bookmarks - what problem do they solve and how do they work?
  • 7How do you handle schema evolution in Glue ETL when upstream data adds new columns?
  • 8What is the 10-minute billing minimum and how does it affect Glue job design?