CloudWatch

Collect and track metrics, logs, and traces; create alarms and automated responses

Amazon CloudWatch is the central observability service for AWS, collecting metrics, logs, and traces from virtually every AWS service and your own applications. It enables you to monitor infrastructure health, set alarms on thresholds, and trigger automated remediation - making it the foundation of any production operations strategy on AWS.

How CloudWatch Collects and Stores Observability Data

CloudWatch is organized around four core data types, each with its own ingestion path, storage model, and retention behavior:

Data Type	What It Is	Granularity	Default Retention
Metrics	Numeric time-series data points	1-minute (standard), 1-second (high-res)	15 months (aggregated over time)
Logs	Structured or unstructured text events	Sub-second ingestion	Configurable (never expires by default)
Traces	Distributed request flows (X-Ray)	Per-request spans	30 days
Events (EventBridge)	State-change notifications from services	Near real-time	Routed, not stored in CW

Metrics are the most critical concept. AWS services publish metrics automatically into namespaces like AWS/EC2 or AWS/RDS. You can also publish custom metrics using the PutMetricData API or the CloudWatch agent.

💡

Standard metrics have 1-minute resolution. High-resolution custom metrics support 1-second resolution but cost more. For most operational alarms, 1-minute is sufficient. Use high-resolution only for latency-sensitive workloads.

Metric retention tiers automatically aggregate over time: data points are available at the original resolution for 3 hours, at 1-minute resolution for 15 days, at 5-minute resolution for 63 days, and at 1-hour resolution for 455 days.

Alarms, Composite Alarms, and Automated Actions

A CloudWatch Alarm watches a single metric or a math expression and transitions between three states: OK, ALARM, and INSUFFICIENT_DATA. When state changes occur, alarms can trigger SNS notifications, Auto Scaling actions, EC2 actions (reboot, stop, terminate, recover), or Systems Manager OpsItems.

Alarm Type	Description	Use Case
Static threshold	Triggers when metric crosses a fixed value	CPU > 80% for 5 minutes
Anomaly detection	ML-based band around expected value	Detect unusual traffic patterns without knowing exact thresholds
Metric math alarm	Alarm on calculated expression of multiple metrics	Error rate = errors/requests > 1%
Composite alarm	Combine multiple alarms with AND/OR logic	Reduce alert noise - only page if CPU high AND memory high

Composite alarms are underused but powerful. They let you suppress noisy individual alarms by only triggering when multiple conditions are true simultaneously. This dramatically reduces false positives in production alerting.

⚠️

An alarm that has never received data stays in INSUFFICIENT_DATA state, not ALARM. If you alarm on a metric that might not always have data (like Lambda error count when there are no errors), use treat_missing_data = notBreaching so the alarm stays OK instead of flipping to INSUFFICIENT_DATA.

bash

# Create an alarm on CPU utilization
aws cloudwatch put-metric-alarm \
  --alarm-name "high-cpu-web-server" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

CloudWatch Logs, Logs Insights, and the CW Agent

CloudWatch Logs stores log events in Log Groups, each containing multiple Log Streams. A Log Group is typically one application or service. A Log Stream is one instance or source (one Lambda function invocation container, one EC2 instance).

CloudWatch Logs Insights provides a query language to search and analyze log data. It's far more powerful than the basic filter and can aggregate across log groups.

bash

# Logs Insights: find the top 10 slowest Lambda invocations
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| sort @duration desc
| limit 10

# Count errors per minute
filter @message like /ERROR/
| stats count(*) as error_count by bin(1m)
| sort @timestamp asc

The CloudWatch Agent must be installed on EC2 instances (and on-premises servers) to collect system-level metrics like memory utilization, disk usage, and swap - these are not available from the hypervisor. The agent also ships log files to CloudWatch Logs.

Metric	Available Without Agent	Available With Agent
CPU Utilization	Yes (AWS/EC2)	Yes (CWAgent namespace)
Network In/Out	Yes (AWS/EC2)	Yes
Memory Utilization	No	Yes (CWAgent)
Disk Space Used %	No	Yes (CWAgent)
Disk I/O	Partial (DiskReadOps)	Full detail
Custom app log files	No	Yes

⚠️

Memory and disk utilization are among the most common interview questions about CloudWatch. Interviewers frequently ask why you cannot see memory usage for EC2 without the agent. The answer: AWS hypervisors can observe CPU and network from outside the VM, but memory allocation is visible only from inside the OS.

Dashboards, Container Insights, and Application Insights

CloudWatch Dashboards provide customizable views across metrics and alarms. They can include graphs, number widgets, alarm status widgets, and Logs Insights query results. Dashboards can be shared publicly (read-only) or accessed via cross-account sharing.

Container Insights is a specialized feature for ECS and EKS that collects CPU, memory, network, and disk metrics at the cluster, service, task, and container level using an embedded agent. It also includes pre-built dashboards.

Application Insights automatically discovers application components (EC2, RDS, ELB, etc.), establishes baselines, and creates alarms for anomalies. It's particularly useful for .NET and SQL Server workloads on EC2.

Feature	Best For	Extra Cost?
Standard dashboards	Custom operational views	Yes - $3/dashboard/month
Container Insights	ECS/EKS container-level metrics	Yes - pay per metric and log
Application Insights	Auto-configured monitoring for known app stacks	Yes - per resource monitored
Anomaly Detection	Adaptive alarms without fixed thresholds	Yes - per metric per month
Contributor Insights	Find top contributors in log data	Yes - per rule per million events

Pricing Model and Cost Optimization

CloudWatch pricing has multiple dimensions. Costs can grow unexpectedly if you're not careful about custom metrics, log volume, and API call frequency.

Component	Free Tier	Paid Rate (us-east-1)
Metrics (custom)	First 10 metrics free	$0.30/metric/month
API requests (GetMetricData)	First 1M/month free	$0.01 per 1,000 metrics requested
Alarms	First 10 alarms free	$0.10/alarm/month (standard)
Log data ingestion	5 GB/month free	$0.50/GB ingested
Log data storage	5 GB/month free	$0.03/GB/month
Logs Insights queries	N/A	$0.005/GB data scanned
Dashboards	First 3 free	$3/dashboard/month

💡

Log costs are often the largest CloudWatch expense. Set log retention policies on every Log Group - the default is to never expire. For most debug logs, 7-30 days is sufficient. For compliance logs, archive to S3 Glacier after 90 days using log subscription filters.

bash

# Set retention policy on a log group (never let logs accumulate indefinitely)
aws logs put-retention-policy \
  --log-group-name /aws/lambda/my-function \
  --retention-in-days 30

# Find log groups with no retention policy set
aws logs describe-log-groups \
  --query "logGroups[?!retentionInDays].logGroupName" \
  --output text

🎯

Interview Focus Points

1Why can't you see memory utilization for EC2 instances in CloudWatch by default, and how do you fix it?
2What is the difference between a CloudWatch metric, a log, and a trace - when would you use each?
3Explain composite alarms and give a real scenario where they reduce alert noise.
4How would you detect an anomalous spike in API error rate without knowing the expected baseline?
5What happens to a CloudWatch alarm when it has no data - what is treat_missing_data and why does it matter?
6How do you control CloudWatch Logs costs in a high-throughput microservices architecture?
7What is the difference between CloudWatch Logs Insights and CloudWatch Metrics Insights?
8How would you set up a unified operational dashboard for a multi-tier application (ALB, EC2, RDS)?
9Explain how CloudWatch integrates with Auto Scaling to scale EC2 instances based on a custom application metric.