AWS Analytics & Big Data
MSK
Fully managed Apache Kafka for real-time event streaming pipelines
Amazon MSK (Managed Streaming for Apache Kafka) is a fully managed service that runs Apache Kafka on AWS without requiring you to provision, configure, or patch Kafka brokers, ZooKeeper, or Kafka Raft (KRaft) nodes. MSK handles broker scaling, storage growth, security patching, and multi-AZ replication, while giving you full access to native Kafka producer/consumer APIs. It is the correct choice when you need Kafka protocol compatibility, very high throughput, or are migrating an existing Kafka workload to AWS.
MSK Architecture - Brokers, ZooKeeper, and KRaft
An MSK cluster consists of Kafka brokers spread across multiple Availability Zones for high availability. MSK manages the control plane (ZooKeeper or KRaft) invisibly.
| Component | MSK Managed? | Your Responsibility |
|---|---|---|
| Kafka broker EC2 instances | Yes - provisioned, patched | Choose instance type and count |
| ZooKeeper / KRaft quorum | Yes - fully managed | None |
| Broker storage (EBS) | Yes - auto-expand available | Set initial size; enable auto-expand |
| Kafka topics and partitions | No | You create, size, and manage topics |
| Producer/consumer clients | No | Your application code |
| MSK Connect (Kafka Connect) | Yes - managed connectors | Configure connector workers |
MSK Serverless is a completely hands-off mode - no broker count, no instance types. You just create a cluster and start producing. MSK Serverless is priced per partition-hour and per GB transferred, making it ideal for variable or unpredictable workloads. Provisioned MSK is better for steady high-throughput workloads where you can right-size brokers.
MSK vs Kinesis Data Streams - Detailed Comparison
MSK and Kinesis Data Streams solve similar problems. The choice depends on throughput, existing ecosystem, and operational preference.
| Dimension | MSK (Kafka) | Kinesis Data Streams |
|---|---|---|
| Protocol | Native Kafka API | AWS proprietary API |
| Throughput limit | No hard limit - add brokers/partitions | 1 MB/s per shard |
| Partition scaling | Add partitions to existing topics anytime | Shard split (minutes delay) |
| Consumer groups | Yes - Kafka consumer groups with offset commits | KCL checkpointing in DynamoDB |
| Retention | Configurable per topic (hours to unlimited with tiered storage) | 24h default, up to 365 days |
| Message ordering | Per partition | Per shard |
| Ecosystem | Kafka Connect, Kafka Streams, Flink, Spark | Lambda, KDA, limited ecosystem |
| Operational burden | Medium - you manage topics, retention, ACLs | Low - fewer knobs |
| Cost at low volume | Higher - broker minimum ~$0.21/hr per broker | Lower - pay per shard-hour at $0.015/hr |
Choose MSK when: you need Kafka protocol compatibility, you are migrating from on-premises Kafka, you need Kafka Streams or Kafka Connect, or you need very high throughput (100 MB/s+). Choose Kinesis when: you are building a new AWS-native pipeline, throughput is moderate, and you want minimal operational overhead.
MSK Security - Encryption, Authentication, and Authorization
MSK supports multiple security layers that can be combined:
| Security Layer | Options | Notes |
|---|---|---|
| Encryption in transit | TLS (enforced or optional) | Enable TLS for all production clusters |
| Encryption at rest | AWS KMS (default or CMK) | Enabled by default |
| Client authentication | IAM, SASL/SCRAM, mTLS | IAM preferred for AWS-native clients |
| Authorization (ACLs) | Kafka ACLs or IAM policies | Kafka ACLs for per-topic control |
| Network access | VPC with security groups | MSK never exposes public endpoints by default |
MSK IAM authentication requires the AWS MSK IAM auth library in your producer/consumer clients. It cannot be used with standard Kafka CLI tools without additional configuration. For ops tools and migration tasks, SASL/SCRAM is often simpler. Many teams use IAM for application clients and SASL/SCRAM for Kafka Connect workers.
MSK Connect - Managed Kafka Connect Workers
MSK Connect is a managed Kafka Connect runtime. Instead of running Kafka Connect workers on EC2, you deploy connectors as MSK Connect workers that auto-scale.
| Connector Type | Use Case | Example |
|---|---|---|
| Source connector | Pull data from external systems into Kafka | Debezium CDC from RDS, S3 source connector |
| Sink connector | Push data from Kafka to a destination | S3 Sink (write to data lake), OpenSearch Sink |
# Create an MSK Connect connector (S3 Sink)
aws kafkaconnect create-connector \
--connector-name "s3-sink" \
--kafka-cluster ClusterArn=arn:aws:kafka:...,VpcConfig={...} \
--connector-configuration \
"connector.class=io.confluent.connect.s3.S3SinkConnector,\
tasks.max=4,\
topics=user-events,\
s3.region=us-east-1,\
s3.bucket.name=my-data-lake,\
flush.size=1000,\
storage.class=io.confluent.connect.s3.storage.S3Storage,\
format.class=io.confluent.connect.s3.format.parquet.ParquetFormat" \
--capacity AutoScaling={...}Debezium running on MSK Connect is one of the most popular patterns for Change Data Capture (CDC) - it reads database transaction logs (PostgreSQL WAL, MySQL binlog) and produces Kafka events for every row change. This powers real-time data lake synchronization without polling.
MSK Tiered Storage and Cost Optimization
MSK Tiered Storage automatically offloads older log segments to S3, reducing broker EBS storage costs significantly while maintaining consumer access to historical data.
| Storage Tier | Location | Cost | Latency |
|---|---|---|---|
| Local (hot) | Broker EBS volumes | $0.10-0.16/GB-month | Microseconds |
| Tiered (S3) | S3 Standard | $0.023/GB-month | Milliseconds (on first read) |
Enable tiered storage when topic retention is longer than a few days. The broker EBS cost dominates MSK spend at high retention - tiered storage can reduce storage costs by 80%+ for topics with weeks or months of retention. Consumers do not need to change - they use the same offset-based API.
Interview Focus Points
- 1When would you choose MSK over Kinesis Data Streams for a new event streaming pipeline on AWS?
- 2Explain MSK Serverless - how does it differ from provisioned MSK and what workloads suit it?
- 3What is Change Data Capture (CDC) and how would you implement it using MSK and Debezium?
- 4How does MSK handle high availability across Availability Zones?
- 5Compare IAM authentication vs SASL/SCRAM for MSK client authentication - when do you use each?
- 6What is MSK Tiered Storage and how does it affect producer/consumer behavior?
- 7How do Kafka ACLs work in MSK and how do you grant per-topic access to specific consumers?
- 8Walk me through sizing an MSK cluster for a workload that produces 500 MB/s peak throughput.