Ace Cloud Interviews
🛠️

AWS Developer Tools & CI/CD

X-Ray

Distributed tracing to analyze latency and debug microservices and serverless apps

AWS X-Ray is a distributed tracing service that helps you analyze and debug distributed applications, including microservices and serverless architectures. It collects data about requests as they travel through your application, maps service dependencies automatically, and helps you identify performance bottlenecks, errors, and throttling in complex multi-service systems.

Core Concepts: Traces, Segments, and Subsegments

X-Ray organizes telemetry data into a hierarchy:

ConceptDescriptionGenerated By
TraceEnd-to-end request journey through all servicesX-Ray SDK (trace ID header)
SegmentWork done by a single service for one requestX-Ray SDK in each service
SubsegmentGranular unit within a segment (DB call, HTTP request)SDK auto-instrumentation or manual
AnnotationKey-value pair indexed for search/filteringDeveloper adds to segment
MetadataNon-indexed key-value data for debuggingDeveloper adds to segment
Service graphVisual map of services and their connectionsGenerated from trace data

X-Ray uses trace IDs in HTTP headers to correlate requests across services. The header name is X-Amzn-Trace-Id.

bash
# X-Ray trace ID header example
X-Amzn-Trace-Id: Root=1-5e1b4d3e-fb1234567890abcdef012345;Parent=53995c3f42cd8ad8;Sampled=1

# Root: trace ID (timestamp + UUID)
# Parent: parent segment ID
# Sampled: 1 = send to X-Ray, 0 = do not send
💡

If a service calls a downstream service and does not propagate the X-Amzn-Trace-Id header, X-Ray creates a new root trace for the downstream call. The two traces will not be linked, and the service map will show a broken connection. Always propagate the header in HTTP clients.

SDK Instrumentation: Automatic and Manual

X-Ray SDKs are available for Java, Python, Go, Node.js, Ruby, and .NET. They provide automatic instrumentation for popular frameworks and manual instrumentation via the capture API.

bash
# Python - instrument Flask app and AWS SDK calls
from aws_xray_sdk.core import xray_recorder, patch_all
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware

app = Flask(__name__)
xray_recorder.configure(service='my-flask-app')
XRayMiddleware(app, xray_recorder)
patch_all()  # auto-instrument boto3, requests, urllib3

# Add annotation (indexed, filterable)
@xray_recorder.capture('process_order')
def process_order(order_id):
    xray_recorder.current_segment().put_annotation('order_id', order_id)
    xray_recorder.current_segment().put_metadata('order_detail', order_data)
    # ... business logic

# Node.js - Lambda instrumentation
const AWSXRay = require('aws-xray-sdk-core')
const AWS = AWSXRay.captureAWS(require('aws-sdk'))
// Now all AWS SDK calls are automatically traced
IntegrationWhat It Auto-InstrumentsSDK/Method
AWS SDK callsAll API calls to AWS servicespatch_all() or captureAWS()
Outgoing HTTPCalls to external APIs and servicespatch requests/http.client
SQL databasesQueries to MySQL, PostgreSQLpatch sqlalchemy/pg8000
Flask/DjangoIncoming HTTP requests as segmentsXRayMiddleware
Express.jsIncoming HTTP requests as segmentsxray.express.openSegment()
LambdaFunction invocation as segmentEnable active tracing in function config

Sampling Rules: Controlling Trace Volume

X-Ray samples requests to avoid tracing 100% of traffic (which would be expensive and noisy). Sampling rules determine which requests get traced.

Rule ComponentDescriptionExample
Fixed ratePercentage of requests sampled after reservoir5% = 0.05
ReservoirMinimum traces per second guaranteed5 requests/second always traced
PriorityLower number = higher priority when rules overlap1 = highest priority
Match criteriaFilter by service, URL path, method, host, userPath=/api/health/* -> 0% (ignore health checks)
bash
# Custom sampling rule via CLI - trace all /api/checkout requests
aws xray create-sampling-rule --sampling-rule '{
  "RuleName": "checkout-100pct",
  "Priority": 1,
  "FixedRate": 1.0,
  "ReservoirSize": 50,
  "ServiceName": "checkout-service",
  "ServiceType": "*",
  "Host": "*",
  "HTTPMethod": "POST",
  "URLPath": "/api/checkout",
  "Version": 1
}'
💡

The default sampling rule traces the first request each second plus 5% of additional requests. For high-traffic services, the default rule may miss intermittent errors. Create higher-rate rules for critical paths (payments, authentication) and lower-rate rules for health check endpoints.

Service Map, Traces Console, and Analytics

The X-Ray console provides several views for analyzing application behavior:

ViewWhat It ShowsUse Case
Service MapVisual graph of all services with latency/error metricsIdentify which service is causing errors
TracesList of individual trace records with timelineDebug a specific slow or failed request
Trace AnalyticsAggregate statistics, percentiles, histogramsLatency trends, error rates over time
InsightsML-detected anomalies and root cause analysisProactive issue detection
GroupsFiltered subsets of traces with their own service mapsIsolate traces for one customer or feature

Trace filter expressions let you search for specific traces:

bash
# X-Ray filter expressions
# All traces with errors
fault = true

# Slow requests over 2 seconds
duration > 2

# Traces touching a specific service
service("users-service")

# Filter by annotation
annotation.order_id = "ORD-12345"

# Combine conditions
service("payment-service") AND fault = true AND duration > 1

Pricing, Limits, and Integration with CloudWatch

DimensionFree TierPaid Price
Traces recorded100,000 traces/month$0.50 per 1 million traces
Traces retrieved (console/API)1 million per month$0.50 per 1 million traces
Traces scanned (analytics)1 million per month$0.50 per 1 million traces scanned

X-Ray integrates with CloudWatch ServiceLens, which combines X-Ray traces, CloudWatch metrics, and CloudWatch Logs into a unified observability view. Enabling X-Ray active tracing on Lambda and enabling CloudWatch Container Insights on ECS/EKS automatically populates the ServiceLens service map.

⚠️

X-Ray trace data is retained for 30 days. After 30 days, traces are automatically deleted. If you need longer retention for compliance or trend analysis, export trace summaries to S3 via X-Ray APIs or use CloudWatch Logs Insights on your application logs instead.

🎯

Interview Focus Points

  • 1What is a trace vs a segment vs a subsegment in X-Ray?
  • 2How does X-Ray correlate requests across multiple services?
  • 3What is sampling in X-Ray and why is it important?
  • 4How do you add custom annotations and metadata to X-Ray traces?
  • 5How does X-Ray active tracing work with Lambda functions?
  • 6What is a service map and what information does it show?
  • 7How would you debug a latency issue using X-Ray?
  • 8What is the X-Ray daemon and when is it needed?
  • 9How does X-Ray integrate with CloudWatch ServiceLens?
  • 10What are the data retention limits for X-Ray traces?