Athena

Serverless SQL query engine for data stored in S3 - pay per query, no infrastructure

Amazon Athena is a serverless, interactive query service that lets you run SQL directly against data stored in S3 without loading it into a database. You pay only for the data scanned per query (roughly $5 per TB), making it extremely cost-effective for ad-hoc analytics on large datasets. Athena uses Presto under the hood and integrates tightly with the Glue Data Catalog for schema management.

How Athena Executes Queries

Athena is a managed Presto cluster. When you submit a query, Athena resolves table schemas from the Glue Data Catalog, plans the query, and dispatches distributed read tasks directly against S3. Results are written to an S3 output location you configure.

Key aspects of query execution:

Aspect	Detail
Schema on read	Data stays in S3 in any format - schema is applied at query time
Formats supported	CSV, JSON, Parquet, ORC, Avro, TSV, Gzip/Snappy compressed
Concurrency	Default 20 concurrent queries per account (increase via support)
Query result cache	Results cached for 0-7 days - reuse avoids re-scanning S3
DML support	INSERT INTO (write to S3), CREATE TABLE AS SELECT (CTAS)
Federation	Query RDS, DynamoDB, Redshift, and custom sources via connectors

💡

Athena results are always written to an S3 output bucket you control. The results from the last query for a given query ID are reusable - Athena result reuse can eliminate repeated scans on unchanged data.

Performance and Cost Optimization

Athena charges $5 per TB scanned. The single most impactful optimization is reducing the amount of data scanned:

Optimization	Savings Potential	How It Works
Columnar format (Parquet/ORC)	60-90%	Only scans columns referenced in SELECT/WHERE
Partitioning	50-99%	Prunes partitions with WHERE on partition key (e.g. year/month/day)
Compression (Snappy/Zstd)	30-70%	Fewer bytes to read from S3
File size (128-512 MB)	10-40%	Avoids S3 request overhead from thousands of small files
Result reuse	Up to 100%	Identical queries within TTL return cached result, no scan
Projection pruning	20-60%	SELECT specific columns not SELECT *

⚠️

Partitioning only helps if your WHERE clause filters on the partition key. A query like WHERE event_time > '2025-01-01' does nothing if you partitioned by year/month/day but wrote the column as a timestamp string. Always use partition projection or Glue partitions with consistent naming.

bash

-- Create a partitioned external table
CREATE EXTERNAL TABLE events (
  user_id STRING,
  event_type STRING,
  event_data STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
STORED AS PARQUET
LOCATION 's3://my-data-lake/events/';

-- Use partition projection to avoid MSCK REPAIR TABLE
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.year.type' = 'integer',
  'projection.year.range' = '2023,2030',
  'projection.month.type' = 'integer',
  'projection.month.range' = '01,12',
  'projection.month.digits' = '2',
  'projection.day.type' = 'integer',
  'projection.day.range' = '01,31',
  'projection.day.digits' = '2',
  'storage.location.template' = 's3://my-data-lake/events/${year}/${month}/${day}/'
)

Workgroups - Isolation, Cost Control, and Access

Workgroups let you isolate query environments, enforce spending limits, and separate output locations per team or use case.

Workgroup Setting	What It Controls
Data usage limit	Max bytes scanned per query or per workgroup per day - stops runaway queries
Output location override	Force all results to a specific S3 path (per-team buckets)
Encryption	SSE-S3, SSE-KMS, or CSE-KMS for query results
Query result reuse	Enable/disable result caching per workgroup
Engine version	Control Athena engine version (Athena engine v3 = latest)

💡

Setting a per-query data usage limit (e.g. 1 GB max scan) is the best safeguard against analysts accidentally running full table scans on petabyte datasets. Combine with IAM policies that restrict workgroup access per team.

Athena Federated Query - Querying Beyond S3

Athena Federated Query uses Lambda-based connectors to let you join S3 data with live data in other sources in a single SQL query.

Connector	Source
DynamoDB	NoSQL tables - useful for joining reference data
RDS/Aurora	MySQL, PostgreSQL relational data
Redshift	Data warehouse tables
CloudWatch Logs	Query log groups as SQL tables
DocumentDB	MongoDB-compatible JSON documents
Custom	Any source via the Athena Query Federation SDK

⚠️

Federated queries are slower than native S3 queries because data must pass through Lambda. Each Lambda connector has a memory limit and execution timeout. Do not use federation for high-frequency production queries - it is best suited for ad-hoc analysis.

Pricing Model and When Athena Beats Redshift

Athena pricing is simple: $5 per TB scanned (after compression and file format savings). There are no cluster costs, no idle costs, and no minimum fees.

Scenario	Athena	Redshift
Query frequency	Infrequent/ad-hoc	Frequent/scheduled
Data freshness	Works directly on S3 (real-time)	Requires COPY or Redshift Spectrum
Concurrency	Low-medium (20 concurrent queries)	High (hundreds of concurrent users)
Data volume	Works well at any scale if partitioned	Best for warm, curated datasets
ETL cost	None - no loading step	S3 staging + COPY cost
Cost model	Pay per query	Pay per hour (or Serverless)

💡

Athena is the right default for data lake analytics, log analysis, and teams that query infrequently. Redshift is better for BI dashboards with many concurrent users running complex joins on curated data.

🎯

Interview Focus Points

1How does Athena pricing work and what are the three most impactful ways to reduce cost?
2Why does Parquet format reduce Athena query cost and how does columnar storage enable this?
3Explain Athena partition projection - what problem does it solve vs MSCK REPAIR TABLE?
4How would you architect an S3 data lake that is optimized for Athena queries?
5When would you use Athena over Redshift, and when would you use Redshift over Athena?
6What is Athena Federated Query and what are its limitations for production use?
7How do Athena workgroups help with cost governance in a multi-team environment?
8A query that used to run in 30 seconds now takes 5 minutes - walk me through how you would diagnose and fix it.