MSK

Fully managed Apache Kafka for real-time event streaming pipelines

Amazon MSK (Managed Streaming for Apache Kafka) is a fully managed service that runs Apache Kafka on AWS without requiring you to provision, configure, or patch Kafka brokers, ZooKeeper, or Kafka Raft (KRaft) nodes. MSK handles broker scaling, storage growth, security patching, and multi-AZ replication, while giving you full access to native Kafka producer/consumer APIs. It is the correct choice when you need Kafka protocol compatibility, very high throughput, or are migrating an existing Kafka workload to AWS.

MSK Architecture - Brokers, ZooKeeper, and KRaft

An MSK cluster consists of Kafka brokers spread across multiple Availability Zones for high availability. MSK manages the control plane (ZooKeeper or KRaft) invisibly.

Component	MSK Managed?	Your Responsibility
Kafka broker EC2 instances	Yes - provisioned, patched	Choose instance type and count
ZooKeeper / KRaft quorum	Yes - fully managed	None
Broker storage (EBS)	Yes - auto-expand available	Set initial size; enable auto-expand
Kafka topics and partitions	No	You create, size, and manage topics
Producer/consumer clients	No	Your application code
MSK Connect (Kafka Connect)	Yes - managed connectors	Configure connector workers

💡

MSK Serverless is a completely hands-off mode - no broker count, no instance types. You just create a cluster and start producing. MSK Serverless is priced per partition-hour and per GB transferred, making it ideal for variable or unpredictable workloads. Provisioned MSK is better for steady high-throughput workloads where you can right-size brokers.

MSK vs Kinesis Data Streams - Detailed Comparison

MSK and Kinesis Data Streams solve similar problems. The choice depends on throughput, existing ecosystem, and operational preference.

Dimension	MSK (Kafka)	Kinesis Data Streams
Protocol	Native Kafka API	AWS proprietary API
Throughput limit	No hard limit - add brokers/partitions	1 MB/s per shard
Partition scaling	Add partitions to existing topics anytime	Shard split (minutes delay)
Consumer groups	Yes - Kafka consumer groups with offset commits	KCL checkpointing in DynamoDB
Retention	Configurable per topic (hours to unlimited with tiered storage)	24h default, up to 365 days
Message ordering	Per partition	Per shard
Ecosystem	Kafka Connect, Kafka Streams, Flink, Spark	Lambda, KDA, limited ecosystem
Operational burden	Medium - you manage topics, retention, ACLs	Low - fewer knobs
Cost at low volume	Higher - broker minimum ~$0.21/hr per broker	Lower - pay per shard-hour at $0.015/hr

💡

Choose MSK when: you need Kafka protocol compatibility, you are migrating from on-premises Kafka, you need Kafka Streams or Kafka Connect, or you need very high throughput (100 MB/s+). Choose Kinesis when: you are building a new AWS-native pipeline, throughput is moderate, and you want minimal operational overhead.

MSK Security - Encryption, Authentication, and Authorization

MSK supports multiple security layers that can be combined:

Security Layer	Options	Notes
Encryption in transit	TLS (enforced or optional)	Enable TLS for all production clusters
Encryption at rest	AWS KMS (default or CMK)	Enabled by default
Client authentication	IAM, SASL/SCRAM, mTLS	IAM preferred for AWS-native clients
Authorization (ACLs)	Kafka ACLs or IAM policies	Kafka ACLs for per-topic control
Network access	VPC with security groups	MSK never exposes public endpoints by default

⚠️

MSK IAM authentication requires the AWS MSK IAM auth library in your producer/consumer clients. It cannot be used with standard Kafka CLI tools without additional configuration. For ops tools and migration tasks, SASL/SCRAM is often simpler. Many teams use IAM for application clients and SASL/SCRAM for Kafka Connect workers.

MSK Connect - Managed Kafka Connect Workers

MSK Connect is a managed Kafka Connect runtime. Instead of running Kafka Connect workers on EC2, you deploy connectors as MSK Connect workers that auto-scale.

Connector Type	Use Case	Example
Source connector	Pull data from external systems into Kafka	Debezium CDC from RDS, S3 source connector
Sink connector	Push data from Kafka to a destination	S3 Sink (write to data lake), OpenSearch Sink

bash

# Create an MSK Connect connector (S3 Sink)
aws kafkaconnect create-connector \
  --connector-name "s3-sink" \
  --kafka-cluster ClusterArn=arn:aws:kafka:...,VpcConfig={...} \
  --connector-configuration \
    "connector.class=io.confluent.connect.s3.S3SinkConnector,\
tasks.max=4,\
topics=user-events,\
s3.region=us-east-1,\
s3.bucket.name=my-data-lake,\
flush.size=1000,\
storage.class=io.confluent.connect.s3.storage.S3Storage,\
format.class=io.confluent.connect.s3.format.parquet.ParquetFormat" \
  --capacity AutoScaling={...}

💡

Debezium running on MSK Connect is one of the most popular patterns for Change Data Capture (CDC) - it reads database transaction logs (PostgreSQL WAL, MySQL binlog) and produces Kafka events for every row change. This powers real-time data lake synchronization without polling.

MSK Tiered Storage and Cost Optimization

MSK Tiered Storage automatically offloads older log segments to S3, reducing broker EBS storage costs significantly while maintaining consumer access to historical data.

Storage Tier	Location	Cost	Latency
Local (hot)	Broker EBS volumes	$0.10-0.16/GB-month	Microseconds
Tiered (S3)	S3 Standard	$0.023/GB-month	Milliseconds (on first read)

💡

Enable tiered storage when topic retention is longer than a few days. The broker EBS cost dominates MSK spend at high retention - tiered storage can reduce storage costs by 80%+ for topics with weeks or months of retention. Consumers do not need to change - they use the same offset-based API.

🎯

Interview Focus Points

1When would you choose MSK over Kinesis Data Streams for a new event streaming pipeline on AWS?
2Explain MSK Serverless - how does it differ from provisioned MSK and what workloads suit it?
3What is Change Data Capture (CDC) and how would you implement it using MSK and Debezium?
4How does MSK handle high availability across Availability Zones?
5Compare IAM authentication vs SASL/SCRAM for MSK client authentication - when do you use each?
6What is MSK Tiered Storage and how does it affect producer/consumer behavior?
7How do Kafka ACLs work in MSK and how do you grant per-topic access to specific consumers?
8Walk me through sizing an MSK cluster for a workload that produces 500 MB/s peak throughput.