AWS Monitoring & Management
CloudWatch
Collect and track metrics, logs, and traces; create alarms and automated responses
Amazon CloudWatch is the central observability service for AWS, collecting metrics, logs, and traces from virtually every AWS service and your own applications. It enables you to monitor infrastructure health, set alarms on thresholds, and trigger automated remediation - making it the foundation of any production operations strategy on AWS.
How CloudWatch Collects and Stores Observability Data
CloudWatch is organized around four core data types, each with its own ingestion path, storage model, and retention behavior:
| Data Type | What It Is | Granularity | Default Retention |
|---|---|---|---|
| Metrics | Numeric time-series data points | 1-minute (standard), 1-second (high-res) | 15 months (aggregated over time) |
| Logs | Structured or unstructured text events | Sub-second ingestion | Configurable (never expires by default) |
| Traces | Distributed request flows (X-Ray) | Per-request spans | 30 days |
| Events (EventBridge) | State-change notifications from services | Near real-time | Routed, not stored in CW |
Metrics are the most critical concept. AWS services publish metrics automatically into namespaces like AWS/EC2 or AWS/RDS. You can also publish custom metrics using the PutMetricData API or the CloudWatch agent.
Standard metrics have 1-minute resolution. High-resolution custom metrics support 1-second resolution but cost more. For most operational alarms, 1-minute is sufficient. Use high-resolution only for latency-sensitive workloads.
Metric retention tiers automatically aggregate over time: data points are available at the original resolution for 3 hours, at 1-minute resolution for 15 days, at 5-minute resolution for 63 days, and at 1-hour resolution for 455 days.
Alarms, Composite Alarms, and Automated Actions
A CloudWatch Alarm watches a single metric or a math expression and transitions between three states: OK, ALARM, and INSUFFICIENT_DATA. When state changes occur, alarms can trigger SNS notifications, Auto Scaling actions, EC2 actions (reboot, stop, terminate, recover), or Systems Manager OpsItems.
| Alarm Type | Description | Use Case |
|---|---|---|
| Static threshold | Triggers when metric crosses a fixed value | CPU > 80% for 5 minutes |
| Anomaly detection | ML-based band around expected value | Detect unusual traffic patterns without knowing exact thresholds |
| Metric math alarm | Alarm on calculated expression of multiple metrics | Error rate = errors/requests > 1% |
| Composite alarm | Combine multiple alarms with AND/OR logic | Reduce alert noise - only page if CPU high AND memory high |
Composite alarms are underused but powerful. They let you suppress noisy individual alarms by only triggering when multiple conditions are true simultaneously. This dramatically reduces false positives in production alerting.
An alarm that has never received data stays in INSUFFICIENT_DATA state, not ALARM. If you alarm on a metric that might not always have data (like Lambda error count when there are no errors), use treat_missing_data = notBreaching so the alarm stays OK instead of flipping to INSUFFICIENT_DATA.
# Create an alarm on CPU utilization
aws cloudwatch put-metric-alarm \
--alarm-name "high-cpu-web-server" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alertsCloudWatch Logs, Logs Insights, and the CW Agent
CloudWatch Logs stores log events in Log Groups, each containing multiple Log Streams. A Log Group is typically one application or service. A Log Stream is one instance or source (one Lambda function invocation container, one EC2 instance).
CloudWatch Logs Insights provides a query language to search and analyze log data. It's far more powerful than the basic filter and can aggregate across log groups.
# Logs Insights: find the top 10 slowest Lambda invocations
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| sort @duration desc
| limit 10
# Count errors per minute
filter @message like /ERROR/
| stats count(*) as error_count by bin(1m)
| sort @timestamp ascThe CloudWatch Agent must be installed on EC2 instances (and on-premises servers) to collect system-level metrics like memory utilization, disk usage, and swap - these are not available from the hypervisor. The agent also ships log files to CloudWatch Logs.
| Metric | Available Without Agent | Available With Agent |
|---|---|---|
| CPU Utilization | Yes (AWS/EC2) | Yes (CWAgent namespace) |
| Network In/Out | Yes (AWS/EC2) | Yes |
| Memory Utilization | No | Yes (CWAgent) |
| Disk Space Used % | No | Yes (CWAgent) |
| Disk I/O | Partial (DiskReadOps) | Full detail |
| Custom app log files | No | Yes |
Memory and disk utilization are among the most common interview questions about CloudWatch. Interviewers frequently ask why you cannot see memory usage for EC2 without the agent. The answer: AWS hypervisors can observe CPU and network from outside the VM, but memory allocation is visible only from inside the OS.
Dashboards, Container Insights, and Application Insights
CloudWatch Dashboards provide customizable views across metrics and alarms. They can include graphs, number widgets, alarm status widgets, and Logs Insights query results. Dashboards can be shared publicly (read-only) or accessed via cross-account sharing.
Container Insights is a specialized feature for ECS and EKS that collects CPU, memory, network, and disk metrics at the cluster, service, task, and container level using an embedded agent. It also includes pre-built dashboards.
Application Insights automatically discovers application components (EC2, RDS, ELB, etc.), establishes baselines, and creates alarms for anomalies. It's particularly useful for .NET and SQL Server workloads on EC2.
| Feature | Best For | Extra Cost? |
|---|---|---|
| Standard dashboards | Custom operational views | Yes - $3/dashboard/month |
| Container Insights | ECS/EKS container-level metrics | Yes - pay per metric and log |
| Application Insights | Auto-configured monitoring for known app stacks | Yes - per resource monitored |
| Anomaly Detection | Adaptive alarms without fixed thresholds | Yes - per metric per month |
| Contributor Insights | Find top contributors in log data | Yes - per rule per million events |
Pricing Model and Cost Optimization
CloudWatch pricing has multiple dimensions. Costs can grow unexpectedly if you're not careful about custom metrics, log volume, and API call frequency.
| Component | Free Tier | Paid Rate (us-east-1) |
|---|---|---|
| Metrics (custom) | First 10 metrics free | $0.30/metric/month |
| API requests (GetMetricData) | First 1M/month free | $0.01 per 1,000 metrics requested |
| Alarms | First 10 alarms free | $0.10/alarm/month (standard) |
| Log data ingestion | 5 GB/month free | $0.50/GB ingested |
| Log data storage | 5 GB/month free | $0.03/GB/month |
| Logs Insights queries | N/A | $0.005/GB data scanned |
| Dashboards | First 3 free | $3/dashboard/month |
Log costs are often the largest CloudWatch expense. Set log retention policies on every Log Group - the default is to never expire. For most debug logs, 7-30 days is sufficient. For compliance logs, archive to S3 Glacier after 90 days using log subscription filters.
# Set retention policy on a log group (never let logs accumulate indefinitely)
aws logs put-retention-policy \
--log-group-name /aws/lambda/my-function \
--retention-in-days 30
# Find log groups with no retention policy set
aws logs describe-log-groups \
--query "logGroups[?!retentionInDays].logGroupName" \
--output textInterview Focus Points
- 1Why can't you see memory utilization for EC2 instances in CloudWatch by default, and how do you fix it?
- 2What is the difference between a CloudWatch metric, a log, and a trace - when would you use each?
- 3Explain composite alarms and give a real scenario where they reduce alert noise.
- 4How would you detect an anomalous spike in API error rate without knowing the expected baseline?
- 5What happens to a CloudWatch alarm when it has no data - what is treat_missing_data and why does it matter?
- 6How do you control CloudWatch Logs costs in a high-throughput microservices architecture?
- 7What is the difference between CloudWatch Logs Insights and CloudWatch Metrics Insights?
- 8How would you set up a unified operational dashboard for a multi-tier application (ALB, EC2, RDS)?
- 9Explain how CloudWatch integrates with Auto Scaling to scale EC2 instances based on a custom application metric.