OpenSearch Service

Managed OpenSearch and Elasticsearch clusters for log analytics and full-text search

Amazon OpenSearch Service is a managed service for deploying, operating, and scaling OpenSearch (the open-source fork of Elasticsearch) and Kibana/OpenSearch Dashboards clusters on AWS. It is the go-to service for log analytics, full-text search, real-time application monitoring, and security analytics (SIEM). OpenSearch Service handles cluster provisioning, patching, backups, and cross-cluster replication so you focus on indexing and querying data.

Cluster Architecture - Node Types and Roles

An OpenSearch cluster is made up of nodes with different roles. Properly sizing and separating these roles is critical for production performance.

Node Type	Role	Recommendation
Data nodes	Store shards and serve queries	Use storage-optimized (OR1/I3) for hot data
Dedicated master nodes	Cluster state management only	Use 3 masters for production (quorum)
UltraWarm nodes	Read-only warm tier backed by S3	Cost-effective for 7-90 day data
Cold storage	Long-term retention in S3 (query on demand)	For compliance/audit data
Coordinator nodes (OR1)	Route queries, aggregate results	Add when query concurrency is high

⚠️

Never skip dedicated master nodes in production. Without them, data nodes also handle cluster state - under heavy indexing load, the cluster can become unstable and split-brain. Use 3 dedicated master nodes with Multi-AZ enabled.

💡

OpenSearch Service uses a primary/replica shard model. Each primary shard has one or more replica shards for redundancy. The number of primary shards is fixed at index creation - plan this carefully. Typical recommendation: shard size between 10-50 GB.

Index State Management - Lifecycle, Rollover, and Tiering

For log workloads, indices grow continuously. Index State Management (ISM) automates the lifecycle: rollover when an index hits a size or age threshold, move to UltraWarm, then cold storage, then delete.

ISM Action	When to Use
Rollover	Create new index when current hits X GB or Y days (use with index aliases)
Force merge	Reduce segment count on read-only indices to save memory
Move to UltraWarm	When hot queries are no longer expected (typically 7-30 days)
Move to cold	Infrequently queried data (30-365 days)
Delete	When retention period expires

bash

# Create an ISM policy that rolls over at 50GB or 7 days,
# moves to UltraWarm at 30 days, deletes at 90 days
# (PUT to /_plugins/_ism/policies/log-policy)
{
  "policy": {
    "default_state": "hot",
    "states": [
      {
        "name": "hot",
        "actions": [{
          "rollover": {
            "min_index_age": "7d",
            "min_size": "50gb"
          }
        }],
        "transitions": [{"state_name": "warm", "conditions": {"min_index_age": "30d"}}]
      },
      {
        "name": "warm",
        "actions": [{"warm_migration": {}}],
        "transitions": [{"state_name": "delete", "conditions": {"min_index_age": "90d"}}]
      },
      {
        "name": "delete",
        "actions": [{"delete": {}}],
        "transitions": []
      }
    ]
  }
}

Ingestion Patterns - Logstash, Fluent Bit, Kinesis, and Direct API

OpenSearch accepts data via its HTTP REST API. Several ingestion patterns are common in production:

Ingestion Method	Best For	Notes
OpenSearch Ingestion (managed)	Logs from S3, Kinesis, CloudWatch	Fully managed pipeline - replaces self-hosted Logstash
Kinesis Data Firehose	High-volume event streams	Built-in buffering and retry; Firehose handles backpressure
Fluent Bit DaemonSet (K8s)	Container logs from EKS	Lightweight, low CPU; plugin for OpenSearch HTTP
Logstash	Complex transformations before index	More resource-intensive than Fluent Bit
Direct _bulk API	Custom applications, batch loaders	Batch 500-5000 docs per request for throughput

💡

OpenSearch Ingestion (the managed pipeline service) replaces the need to run self-hosted Logstash or Fluentd on EC2. It scales automatically and integrates with IAM for authentication. Use it for new greenfield deployments.

Pricing - Instances, Storage, and UltraWarm

Component	Cost Driver	Optimization
Data node instances	Instance type x hours	Reserved instances for 30-60% savings on baseline nodes
EBS storage (gp3)	$0.135/GB-month	Use gp3 - cheaper and faster than gp2
UltraWarm storage	$0.024/GB-month	5x cheaper than EBS for warm data
Cold storage	$0.01/GB-month	For compliance/audit data rarely queried
Data transfer out	Standard AWS rates	Keep consumers in same region
Dedicated master nodes	Instance type x hours	Required for production; size down from data nodes

💡

UltraWarm is 5x cheaper than hot EBS storage but queries are slower (seconds vs milliseconds). For log data older than 30 days that is only queried during incidents, UltraWarm provides excellent cost savings without sacrificing operational utility.

OpenSearch vs CloudWatch Logs Insights for Log Analytics

Dimension	OpenSearch Service	CloudWatch Logs Insights
Query language	Lucene + SQL + PPL	CloudWatch Insights query language
Full-text search	Excellent - inverted index	Limited - pattern matching only
Cost for high volume	Lower at scale with UltraWarm	Expensive - charged per GB ingested + scanned
Dashboards	OpenSearch Dashboards (Kibana fork)	CloudWatch Dashboards (simpler)
Alerting	Built-in alerting + anomaly detection	CloudWatch Alarms + Insights scheduled queries
Setup complexity	Higher - cluster sizing, ISM	Zero - fully serverless
AWS service logs	Requires Firehose/delivery pipeline	Native - one-click subscriptions

💡

For AWS service logs (VPC Flow, CloudTrail, ALB access), CloudWatch Logs is simpler to set up. For custom application logs at high volume, or when you need full-text search and complex aggregations, OpenSearch is significantly cheaper and more capable.

🎯

Interview Focus Points

1Why do you need dedicated master nodes in an OpenSearch production cluster, and what happens without them?
2Explain the UltraWarm tier - how does it work and what workloads justify it?
3How does Index State Management work and how would you configure a log lifecycle policy?
4Compare OpenSearch Service to CloudWatch Logs Insights for log analytics - when do you choose each?
5What is the impact of shard count on OpenSearch performance and how do you size shards correctly?
6How would you ingest Kubernetes application logs from an EKS cluster into OpenSearch?
7How does OpenSearch fine-grained access control work with IAM and internal users?
8A search query that used to return in 100ms now takes 5 seconds - walk me through diagnosing the issue.