Ace Cloud Interviews
📈

AWS Analytics & Big Data

Athena

Serverless SQL query engine for data stored in S3 - pay per query, no infrastructure

Amazon Athena is a serverless, interactive query service that lets you run SQL directly against data stored in S3 without loading it into a database. You pay only for the data scanned per query (roughly $5 per TB), making it extremely cost-effective for ad-hoc analytics on large datasets. Athena uses Presto under the hood and integrates tightly with the Glue Data Catalog for schema management.

How Athena Executes Queries

Athena is a managed Presto cluster. When you submit a query, Athena resolves table schemas from the Glue Data Catalog, plans the query, and dispatches distributed read tasks directly against S3. Results are written to an S3 output location you configure.

Key aspects of query execution:

AspectDetail
Schema on readData stays in S3 in any format - schema is applied at query time
Formats supportedCSV, JSON, Parquet, ORC, Avro, TSV, Gzip/Snappy compressed
ConcurrencyDefault 20 concurrent queries per account (increase via support)
Query result cacheResults cached for 0-7 days - reuse avoids re-scanning S3
DML supportINSERT INTO (write to S3), CREATE TABLE AS SELECT (CTAS)
FederationQuery RDS, DynamoDB, Redshift, and custom sources via connectors
💡

Athena results are always written to an S3 output bucket you control. The results from the last query for a given query ID are reusable - Athena result reuse can eliminate repeated scans on unchanged data.

Performance and Cost Optimization

Athena charges $5 per TB scanned. The single most impactful optimization is reducing the amount of data scanned:

OptimizationSavings PotentialHow It Works
Columnar format (Parquet/ORC)60-90%Only scans columns referenced in SELECT/WHERE
Partitioning50-99%Prunes partitions with WHERE on partition key (e.g. year/month/day)
Compression (Snappy/Zstd)30-70%Fewer bytes to read from S3
File size (128-512 MB)10-40%Avoids S3 request overhead from thousands of small files
Result reuseUp to 100%Identical queries within TTL return cached result, no scan
Projection pruning20-60%SELECT specific columns not SELECT *
⚠️

Partitioning only helps if your WHERE clause filters on the partition key. A query like WHERE event_time > '2025-01-01' does nothing if you partitioned by year/month/day but wrote the column as a timestamp string. Always use partition projection or Glue partitions with consistent naming.

bash
-- Create a partitioned external table
CREATE EXTERNAL TABLE events (
  user_id STRING,
  event_type STRING,
  event_data STRING
)
PARTITIONED BY (year STRING, month STRING, day STRING)
STORED AS PARQUET
LOCATION 's3://my-data-lake/events/';

-- Use partition projection to avoid MSCK REPAIR TABLE
TBLPROPERTIES (
  'projection.enabled' = 'true',
  'projection.year.type' = 'integer',
  'projection.year.range' = '2023,2030',
  'projection.month.type' = 'integer',
  'projection.month.range' = '01,12',
  'projection.month.digits' = '2',
  'projection.day.type' = 'integer',
  'projection.day.range' = '01,31',
  'projection.day.digits' = '2',
  'storage.location.template' = 's3://my-data-lake/events/${year}/${month}/${day}/'
)

Workgroups - Isolation, Cost Control, and Access

Workgroups let you isolate query environments, enforce spending limits, and separate output locations per team or use case.

Workgroup SettingWhat It Controls
Data usage limitMax bytes scanned per query or per workgroup per day - stops runaway queries
Output location overrideForce all results to a specific S3 path (per-team buckets)
EncryptionSSE-S3, SSE-KMS, or CSE-KMS for query results
Query result reuseEnable/disable result caching per workgroup
Engine versionControl Athena engine version (Athena engine v3 = latest)
💡

Setting a per-query data usage limit (e.g. 1 GB max scan) is the best safeguard against analysts accidentally running full table scans on petabyte datasets. Combine with IAM policies that restrict workgroup access per team.

Athena Federated Query - Querying Beyond S3

Athena Federated Query uses Lambda-based connectors to let you join S3 data with live data in other sources in a single SQL query.

ConnectorSource
DynamoDBNoSQL tables - useful for joining reference data
RDS/AuroraMySQL, PostgreSQL relational data
RedshiftData warehouse tables
CloudWatch LogsQuery log groups as SQL tables
DocumentDBMongoDB-compatible JSON documents
CustomAny source via the Athena Query Federation SDK
⚠️

Federated queries are slower than native S3 queries because data must pass through Lambda. Each Lambda connector has a memory limit and execution timeout. Do not use federation for high-frequency production queries - it is best suited for ad-hoc analysis.

Pricing Model and When Athena Beats Redshift

Athena pricing is simple: $5 per TB scanned (after compression and file format savings). There are no cluster costs, no idle costs, and no minimum fees.

ScenarioAthenaRedshift
Query frequencyInfrequent/ad-hocFrequent/scheduled
Data freshnessWorks directly on S3 (real-time)Requires COPY or Redshift Spectrum
ConcurrencyLow-medium (20 concurrent queries)High (hundreds of concurrent users)
Data volumeWorks well at any scale if partitionedBest for warm, curated datasets
ETL costNone - no loading stepS3 staging + COPY cost
Cost modelPay per queryPay per hour (or Serverless)
💡

Athena is the right default for data lake analytics, log analysis, and teams that query infrequently. Redshift is better for BI dashboards with many concurrent users running complex joins on curated data.

🎯

Interview Focus Points

  • 1How does Athena pricing work and what are the three most impactful ways to reduce cost?
  • 2Why does Parquet format reduce Athena query cost and how does columnar storage enable this?
  • 3Explain Athena partition projection - what problem does it solve vs MSCK REPAIR TABLE?
  • 4How would you architect an S3 data lake that is optimized for Athena queries?
  • 5When would you use Athena over Redshift, and when would you use Redshift over Athena?
  • 6What is Athena Federated Query and what are its limitations for production use?
  • 7How do Athena workgroups help with cost governance in a multi-team environment?
  • 8A query that used to run in 30 seconds now takes 5 minutes - walk me through how you would diagnose and fix it.