AWS Analytics & Big Data
Athena
Serverless SQL query engine for data stored in S3 - pay per query, no infrastructure
Amazon Athena is a serverless, interactive query service that lets you run SQL directly against data stored in S3 without loading it into a database. You pay only for the data scanned per query (roughly $5 per TB), making it extremely cost-effective for ad-hoc analytics on large datasets. Athena uses Presto under the hood and integrates tightly with the Glue Data Catalog for schema management.
How Athena Executes Queries
Athena is a managed Presto cluster. When you submit a query, Athena resolves table schemas from the Glue Data Catalog, plans the query, and dispatches distributed read tasks directly against S3. Results are written to an S3 output location you configure.
Key aspects of query execution:
| Aspect | Detail |
|---|---|
| Schema on read | Data stays in S3 in any format - schema is applied at query time |
| Formats supported | CSV, JSON, Parquet, ORC, Avro, TSV, Gzip/Snappy compressed |
| Concurrency | Default 20 concurrent queries per account (increase via support) |
| Query result cache | Results cached for 0-7 days - reuse avoids re-scanning S3 |
| DML support | INSERT INTO (write to S3), CREATE TABLE AS SELECT (CTAS) |
| Federation | Query RDS, DynamoDB, Redshift, and custom sources via connectors |
Athena results are always written to an S3 output bucket you control. The results from the last query for a given query ID are reusable - Athena result reuse can eliminate repeated scans on unchanged data.
Performance and Cost Optimization
Athena charges $5 per TB scanned. The single most impactful optimization is reducing the amount of data scanned:
| Optimization | Savings Potential | How It Works |
|---|---|---|
| Columnar format (Parquet/ORC) | 60-90% | Only scans columns referenced in SELECT/WHERE |
| Partitioning | 50-99% | Prunes partitions with WHERE on partition key (e.g. year/month/day) |
| Compression (Snappy/Zstd) | 30-70% | Fewer bytes to read from S3 |
| File size (128-512 MB) | 10-40% | Avoids S3 request overhead from thousands of small files |
| Result reuse | Up to 100% | Identical queries within TTL return cached result, no scan |
| Projection pruning | 20-60% | SELECT specific columns not SELECT * |
Partitioning only helps if your WHERE clause filters on the partition key. A query like WHERE event_time > '2025-01-01' does nothing if you partitioned by year/month/day but wrote the column as a timestamp string. Always use partition projection or Glue partitions with consistent naming.
-- Create a partitioned external table
CREATE EXTERNAL TABLE events (
user_id STRING,
event_type STRING,
event_data STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
STORED AS PARQUET
LOCATION 's3://my-data-lake/events/';
-- Use partition projection to avoid MSCK REPAIR TABLE
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.year.type' = 'integer',
'projection.year.range' = '2023,2030',
'projection.month.type' = 'integer',
'projection.month.range' = '01,12',
'projection.month.digits' = '2',
'projection.day.type' = 'integer',
'projection.day.range' = '01,31',
'projection.day.digits' = '2',
'storage.location.template' = 's3://my-data-lake/events/${year}/${month}/${day}/'
)Workgroups - Isolation, Cost Control, and Access
Workgroups let you isolate query environments, enforce spending limits, and separate output locations per team or use case.
| Workgroup Setting | What It Controls |
|---|---|
| Data usage limit | Max bytes scanned per query or per workgroup per day - stops runaway queries |
| Output location override | Force all results to a specific S3 path (per-team buckets) |
| Encryption | SSE-S3, SSE-KMS, or CSE-KMS for query results |
| Query result reuse | Enable/disable result caching per workgroup |
| Engine version | Control Athena engine version (Athena engine v3 = latest) |
Setting a per-query data usage limit (e.g. 1 GB max scan) is the best safeguard against analysts accidentally running full table scans on petabyte datasets. Combine with IAM policies that restrict workgroup access per team.
Athena Federated Query - Querying Beyond S3
Athena Federated Query uses Lambda-based connectors to let you join S3 data with live data in other sources in a single SQL query.
| Connector | Source |
|---|---|
| DynamoDB | NoSQL tables - useful for joining reference data |
| RDS/Aurora | MySQL, PostgreSQL relational data |
| Redshift | Data warehouse tables |
| CloudWatch Logs | Query log groups as SQL tables |
| DocumentDB | MongoDB-compatible JSON documents |
| Custom | Any source via the Athena Query Federation SDK |
Federated queries are slower than native S3 queries because data must pass through Lambda. Each Lambda connector has a memory limit and execution timeout. Do not use federation for high-frequency production queries - it is best suited for ad-hoc analysis.
Pricing Model and When Athena Beats Redshift
Athena pricing is simple: $5 per TB scanned (after compression and file format savings). There are no cluster costs, no idle costs, and no minimum fees.
| Scenario | Athena | Redshift |
|---|---|---|
| Query frequency | Infrequent/ad-hoc | Frequent/scheduled |
| Data freshness | Works directly on S3 (real-time) | Requires COPY or Redshift Spectrum |
| Concurrency | Low-medium (20 concurrent queries) | High (hundreds of concurrent users) |
| Data volume | Works well at any scale if partitioned | Best for warm, curated datasets |
| ETL cost | None - no loading step | S3 staging + COPY cost |
| Cost model | Pay per query | Pay per hour (or Serverless) |
Athena is the right default for data lake analytics, log analysis, and teams that query infrequently. Redshift is better for BI dashboards with many concurrent users running complex joins on curated data.
Interview Focus Points
- 1How does Athena pricing work and what are the three most impactful ways to reduce cost?
- 2Why does Parquet format reduce Athena query cost and how does columnar storage enable this?
- 3Explain Athena partition projection - what problem does it solve vs MSCK REPAIR TABLE?
- 4How would you architect an S3 data lake that is optimized for Athena queries?
- 5When would you use Athena over Redshift, and when would you use Redshift over Athena?
- 6What is Athena Federated Query and what are its limitations for production use?
- 7How do Athena workgroups help with cost governance in a multi-team environment?
- 8A query that used to run in 30 seconds now takes 5 minutes - walk me through how you would diagnose and fix it.