AWS Analytics & Big Data
Lake Formation
Build, secure, and manage data lakes on Amazon S3 with fine-grained access control
AWS Lake Formation is a service that simplifies building, securing, and managing data lakes on Amazon S3 by providing a central place to define fine-grained access controls, data permissions, and auditing across all analytics services. It acts as a permissions layer on top of S3 and the Glue Data Catalog, enabling column-level and row-level access control that is enforced uniformly whether users query via Athena, EMR, or Redshift Spectrum. Lake Formation is critical for enterprise data governance and compliance scenarios.
How Lake Formation Controls Data Access
Lake Formation sits between IAM/S3 and analytics services. Instead of managing S3 bucket policies and IAM policies separately per service, you grant permissions on Glue Catalog objects (databases, tables, columns) in Lake Formation. The service then enforces these across all integrated consumers.
| Without Lake Formation | With Lake Formation |
|---|---|
| S3 bucket policies + IAM roles per user/service | Single permission grant in Lake Formation console or API |
| No column-level or row-level control in S3 | Column masking, column exclusion, row filter expressions |
| Athena, EMR, Redshift each need their own IAM config | One policy enforced across all three |
| Audit trail scattered across CloudTrail events | Centralized data access audit in Lake Formation |
Lake Formation uses a model called LF-Tags (attribute-based access control) for scalable permission management. Instead of granting access per-table, you assign tags to tables/columns and policies to principals. This scales much better than resource-based grants for large data lakes with hundreds of tables.
Permission Types - SUPER, SELECT, Column, and Row Filters
Lake Formation has a hierarchy of permission types:
| Permission | Level | What It Allows |
|---|---|---|
| CREATE_DATABASE | Database | Create databases in the catalog |
| CREATE_TABLE | Database | Create tables in a database |
| ALTER, DROP | Table | Modify or delete table metadata |
| SELECT | Table or column subset | Query data - can be scoped to specific columns |
| DATA_LOCATION_ACCESS | S3 path | Required for CREATE TABLE and ETL writes |
| Row filter | Table | Filter rows via SQL expression per principal |
| Column mask | Column | Replace column value with null or hash for specific principals |
# Grant column-level SELECT to an IAM role
aws lakeformation grant-permissions \
--principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/AnalystRole \
--permissions SELECT \
--resource '{
"TableWithColumns": {
"DatabaseName": "sales_db",
"Name": "orders",
"ColumnNames": ["order_id", "product", "quantity", "created_at"]
}
}'
# email and credit_card_number columns are excludedLake Formation permissions are in addition to IAM - both must allow the action. A common mistake is granting Lake Formation SELECT but forgetting that the IAM role also needs s3:GetObject on the underlying S3 bucket. Lake Formation provides temporary credentials via its own vend mechanism, which bypasses S3 policies when properly configured.
Cross-Account Data Sharing with Lake Formation
Lake Formation supports sharing Glue Catalog databases and tables with other AWS accounts or AWS Organizations without copying data. The data stays in the producer account's S3; the consumer account queries it via their own Athena or EMR.
| Step | Producer Account | Consumer Account |
|---|---|---|
| 1 | Register S3 location with Lake Formation | - |
| 2 | Grant RAM resource share on database/table | - |
| 3 | - | Accept RAM resource share |
| 4 | - | Create resource link to shared database in own catalog |
| 5 | - | Query via Athena using the resource link |
Cross-account Lake Formation sharing is a key alternative to copying datasets between accounts. The consumer account sees only the columns and rows they have been granted access to - the fine-grained controls transfer across account boundaries.
Interview Focus Points
- 1What problem does Lake Formation solve that plain S3 bucket policies and IAM cannot?
- 2Explain column-level and row-level security in Lake Formation - give a real-world use case.
- 3How does cross-account data sharing work in Lake Formation - does the data get copied?
- 4What are LF-Tags and why are they better than named resource grants for large data lakes?
- 5How does Lake Formation interact with Athena and EMR - does it replace IAM permissions or add to them?
- 6What is the DATA_LOCATION_ACCESS permission and when is it required?
- 7How does Lake Formation audit data access, and how would you integrate this with a compliance workflow?