Ace Cloud Interviews
Home/AWS Tutorial/CloudWatch
📊

AWS Monitoring & Management

CloudWatch

Collect and track metrics, logs, and traces; create alarms and automated responses

Amazon CloudWatch is the central observability service for AWS, collecting metrics, logs, and traces from virtually every AWS service and your own applications. It enables you to monitor infrastructure health, set alarms on thresholds, and trigger automated remediation - making it the foundation of any production operations strategy on AWS.

How CloudWatch Collects and Stores Observability Data

CloudWatch is organized around four core data types, each with its own ingestion path, storage model, and retention behavior:

Data TypeWhat It IsGranularityDefault Retention
MetricsNumeric time-series data points1-minute (standard), 1-second (high-res)15 months (aggregated over time)
LogsStructured or unstructured text eventsSub-second ingestionConfigurable (never expires by default)
TracesDistributed request flows (X-Ray)Per-request spans30 days
Events (EventBridge)State-change notifications from servicesNear real-timeRouted, not stored in CW

Metrics are the most critical concept. AWS services publish metrics automatically into namespaces like AWS/EC2 or AWS/RDS. You can also publish custom metrics using the PutMetricData API or the CloudWatch agent.

💡

Standard metrics have 1-minute resolution. High-resolution custom metrics support 1-second resolution but cost more. For most operational alarms, 1-minute is sufficient. Use high-resolution only for latency-sensitive workloads.

Metric retention tiers automatically aggregate over time: data points are available at the original resolution for 3 hours, at 1-minute resolution for 15 days, at 5-minute resolution for 63 days, and at 1-hour resolution for 455 days.

Alarms, Composite Alarms, and Automated Actions

A CloudWatch Alarm watches a single metric or a math expression and transitions between three states: OK, ALARM, and INSUFFICIENT_DATA. When state changes occur, alarms can trigger SNS notifications, Auto Scaling actions, EC2 actions (reboot, stop, terminate, recover), or Systems Manager OpsItems.

Alarm TypeDescriptionUse Case
Static thresholdTriggers when metric crosses a fixed valueCPU > 80% for 5 minutes
Anomaly detectionML-based band around expected valueDetect unusual traffic patterns without knowing exact thresholds
Metric math alarmAlarm on calculated expression of multiple metricsError rate = errors/requests > 1%
Composite alarmCombine multiple alarms with AND/OR logicReduce alert noise - only page if CPU high AND memory high

Composite alarms are underused but powerful. They let you suppress noisy individual alarms by only triggering when multiple conditions are true simultaneously. This dramatically reduces false positives in production alerting.

⚠️

An alarm that has never received data stays in INSUFFICIENT_DATA state, not ALARM. If you alarm on a metric that might not always have data (like Lambda error count when there are no errors), use treat_missing_data = notBreaching so the alarm stays OK instead of flipping to INSUFFICIENT_DATA.

bash
# Create an alarm on CPU utilization
aws cloudwatch put-metric-alarm \
  --alarm-name "high-cpu-web-server" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --treat-missing-data notBreaching \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

CloudWatch Logs, Logs Insights, and the CW Agent

CloudWatch Logs stores log events in Log Groups, each containing multiple Log Streams. A Log Group is typically one application or service. A Log Stream is one instance or source (one Lambda function invocation container, one EC2 instance).

CloudWatch Logs Insights provides a query language to search and analyze log data. It's far more powerful than the basic filter and can aggregate across log groups.

bash
# Logs Insights: find the top 10 slowest Lambda invocations
fields @timestamp, @duration, @requestId
| filter @type = "REPORT"
| sort @duration desc
| limit 10

# Count errors per minute
filter @message like /ERROR/
| stats count(*) as error_count by bin(1m)
| sort @timestamp asc

The CloudWatch Agent must be installed on EC2 instances (and on-premises servers) to collect system-level metrics like memory utilization, disk usage, and swap - these are not available from the hypervisor. The agent also ships log files to CloudWatch Logs.

MetricAvailable Without AgentAvailable With Agent
CPU UtilizationYes (AWS/EC2)Yes (CWAgent namespace)
Network In/OutYes (AWS/EC2)Yes
Memory UtilizationNoYes (CWAgent)
Disk Space Used %NoYes (CWAgent)
Disk I/OPartial (DiskReadOps)Full detail
Custom app log filesNoYes
⚠️

Memory and disk utilization are among the most common interview questions about CloudWatch. Interviewers frequently ask why you cannot see memory usage for EC2 without the agent. The answer: AWS hypervisors can observe CPU and network from outside the VM, but memory allocation is visible only from inside the OS.

Dashboards, Container Insights, and Application Insights

CloudWatch Dashboards provide customizable views across metrics and alarms. They can include graphs, number widgets, alarm status widgets, and Logs Insights query results. Dashboards can be shared publicly (read-only) or accessed via cross-account sharing.

Container Insights is a specialized feature for ECS and EKS that collects CPU, memory, network, and disk metrics at the cluster, service, task, and container level using an embedded agent. It also includes pre-built dashboards.

Application Insights automatically discovers application components (EC2, RDS, ELB, etc.), establishes baselines, and creates alarms for anomalies. It's particularly useful for .NET and SQL Server workloads on EC2.

FeatureBest ForExtra Cost?
Standard dashboardsCustom operational viewsYes - $3/dashboard/month
Container InsightsECS/EKS container-level metricsYes - pay per metric and log
Application InsightsAuto-configured monitoring for known app stacksYes - per resource monitored
Anomaly DetectionAdaptive alarms without fixed thresholdsYes - per metric per month
Contributor InsightsFind top contributors in log dataYes - per rule per million events

Pricing Model and Cost Optimization

CloudWatch pricing has multiple dimensions. Costs can grow unexpectedly if you're not careful about custom metrics, log volume, and API call frequency.

ComponentFree TierPaid Rate (us-east-1)
Metrics (custom)First 10 metrics free$0.30/metric/month
API requests (GetMetricData)First 1M/month free$0.01 per 1,000 metrics requested
AlarmsFirst 10 alarms free$0.10/alarm/month (standard)
Log data ingestion5 GB/month free$0.50/GB ingested
Log data storage5 GB/month free$0.03/GB/month
Logs Insights queriesN/A$0.005/GB data scanned
DashboardsFirst 3 free$3/dashboard/month
💡

Log costs are often the largest CloudWatch expense. Set log retention policies on every Log Group - the default is to never expire. For most debug logs, 7-30 days is sufficient. For compliance logs, archive to S3 Glacier after 90 days using log subscription filters.

bash
# Set retention policy on a log group (never let logs accumulate indefinitely)
aws logs put-retention-policy \
  --log-group-name /aws/lambda/my-function \
  --retention-in-days 30

# Find log groups with no retention policy set
aws logs describe-log-groups \
  --query "logGroups[?!retentionInDays].logGroupName" \
  --output text
🎯

Interview Focus Points

  • 1Why can't you see memory utilization for EC2 instances in CloudWatch by default, and how do you fix it?
  • 2What is the difference between a CloudWatch metric, a log, and a trace - when would you use each?
  • 3Explain composite alarms and give a real scenario where they reduce alert noise.
  • 4How would you detect an anomalous spike in API error rate without knowing the expected baseline?
  • 5What happens to a CloudWatch alarm when it has no data - what is treat_missing_data and why does it matter?
  • 6How do you control CloudWatch Logs costs in a high-throughput microservices architecture?
  • 7What is the difference between CloudWatch Logs Insights and CloudWatch Metrics Insights?
  • 8How would you set up a unified operational dashboard for a multi-tier application (ALB, EC2, RDS)?
  • 9Explain how CloudWatch integrates with Auto Scaling to scale EC2 instances based on a custom application metric.