AWS Analytics & Big Data
OpenSearch Service
Managed OpenSearch and Elasticsearch clusters for log analytics and full-text search
Amazon OpenSearch Service is a managed service for deploying, operating, and scaling OpenSearch (the open-source fork of Elasticsearch) and Kibana/OpenSearch Dashboards clusters on AWS. It is the go-to service for log analytics, full-text search, real-time application monitoring, and security analytics (SIEM). OpenSearch Service handles cluster provisioning, patching, backups, and cross-cluster replication so you focus on indexing and querying data.
Cluster Architecture - Node Types and Roles
An OpenSearch cluster is made up of nodes with different roles. Properly sizing and separating these roles is critical for production performance.
| Node Type | Role | Recommendation |
|---|---|---|
| Data nodes | Store shards and serve queries | Use storage-optimized (OR1/I3) for hot data |
| Dedicated master nodes | Cluster state management only | Use 3 masters for production (quorum) |
| UltraWarm nodes | Read-only warm tier backed by S3 | Cost-effective for 7-90 day data |
| Cold storage | Long-term retention in S3 (query on demand) | For compliance/audit data |
| Coordinator nodes (OR1) | Route queries, aggregate results | Add when query concurrency is high |
Never skip dedicated master nodes in production. Without them, data nodes also handle cluster state - under heavy indexing load, the cluster can become unstable and split-brain. Use 3 dedicated master nodes with Multi-AZ enabled.
OpenSearch Service uses a primary/replica shard model. Each primary shard has one or more replica shards for redundancy. The number of primary shards is fixed at index creation - plan this carefully. Typical recommendation: shard size between 10-50 GB.
Index State Management - Lifecycle, Rollover, and Tiering
For log workloads, indices grow continuously. Index State Management (ISM) automates the lifecycle: rollover when an index hits a size or age threshold, move to UltraWarm, then cold storage, then delete.
| ISM Action | When to Use |
|---|---|
| Rollover | Create new index when current hits X GB or Y days (use with index aliases) |
| Force merge | Reduce segment count on read-only indices to save memory |
| Move to UltraWarm | When hot queries are no longer expected (typically 7-30 days) |
| Move to cold | Infrequently queried data (30-365 days) |
| Delete | When retention period expires |
# Create an ISM policy that rolls over at 50GB or 7 days,
# moves to UltraWarm at 30 days, deletes at 90 days
# (PUT to /_plugins/_ism/policies/log-policy)
{
"policy": {
"default_state": "hot",
"states": [
{
"name": "hot",
"actions": [{
"rollover": {
"min_index_age": "7d",
"min_size": "50gb"
}
}],
"transitions": [{"state_name": "warm", "conditions": {"min_index_age": "30d"}}]
},
{
"name": "warm",
"actions": [{"warm_migration": {}}],
"transitions": [{"state_name": "delete", "conditions": {"min_index_age": "90d"}}]
},
{
"name": "delete",
"actions": [{"delete": {}}],
"transitions": []
}
]
}
}Ingestion Patterns - Logstash, Fluent Bit, Kinesis, and Direct API
OpenSearch accepts data via its HTTP REST API. Several ingestion patterns are common in production:
| Ingestion Method | Best For | Notes |
|---|---|---|
| OpenSearch Ingestion (managed) | Logs from S3, Kinesis, CloudWatch | Fully managed pipeline - replaces self-hosted Logstash |
| Kinesis Data Firehose | High-volume event streams | Built-in buffering and retry; Firehose handles backpressure |
| Fluent Bit DaemonSet (K8s) | Container logs from EKS | Lightweight, low CPU; plugin for OpenSearch HTTP |
| Logstash | Complex transformations before index | More resource-intensive than Fluent Bit |
| Direct _bulk API | Custom applications, batch loaders | Batch 500-5000 docs per request for throughput |
OpenSearch Ingestion (the managed pipeline service) replaces the need to run self-hosted Logstash or Fluentd on EC2. It scales automatically and integrates with IAM for authentication. Use it for new greenfield deployments.
Pricing - Instances, Storage, and UltraWarm
| Component | Cost Driver | Optimization |
|---|---|---|
| Data node instances | Instance type x hours | Reserved instances for 30-60% savings on baseline nodes |
| EBS storage (gp3) | $0.135/GB-month | Use gp3 - cheaper and faster than gp2 |
| UltraWarm storage | $0.024/GB-month | 5x cheaper than EBS for warm data |
| Cold storage | $0.01/GB-month | For compliance/audit data rarely queried |
| Data transfer out | Standard AWS rates | Keep consumers in same region |
| Dedicated master nodes | Instance type x hours | Required for production; size down from data nodes |
UltraWarm is 5x cheaper than hot EBS storage but queries are slower (seconds vs milliseconds). For log data older than 30 days that is only queried during incidents, UltraWarm provides excellent cost savings without sacrificing operational utility.
OpenSearch vs CloudWatch Logs Insights for Log Analytics
| Dimension | OpenSearch Service | CloudWatch Logs Insights |
|---|---|---|
| Query language | Lucene + SQL + PPL | CloudWatch Insights query language |
| Full-text search | Excellent - inverted index | Limited - pattern matching only |
| Cost for high volume | Lower at scale with UltraWarm | Expensive - charged per GB ingested + scanned |
| Dashboards | OpenSearch Dashboards (Kibana fork) | CloudWatch Dashboards (simpler) |
| Alerting | Built-in alerting + anomaly detection | CloudWatch Alarms + Insights scheduled queries |
| Setup complexity | Higher - cluster sizing, ISM | Zero - fully serverless |
| AWS service logs | Requires Firehose/delivery pipeline | Native - one-click subscriptions |
For AWS service logs (VPC Flow, CloudTrail, ALB access), CloudWatch Logs is simpler to set up. For custom application logs at high volume, or when you need full-text search and complex aggregations, OpenSearch is significantly cheaper and more capable.
Interview Focus Points
- 1Why do you need dedicated master nodes in an OpenSearch production cluster, and what happens without them?
- 2Explain the UltraWarm tier - how does it work and what workloads justify it?
- 3How does Index State Management work and how would you configure a log lifecycle policy?
- 4Compare OpenSearch Service to CloudWatch Logs Insights for log analytics - when do you choose each?
- 5What is the impact of shard count on OpenSearch performance and how do you size shards correctly?
- 6How would you ingest Kubernetes application logs from an EKS cluster into OpenSearch?
- 7How does OpenSearch fine-grained access control work with IAM and internal users?
- 8A search query that used to return in 100ms now takes 5 seconds - walk me through diagnosing the issue.