AWS Database
Neptune
Fully managed graph database for building applications with highly connected datasets
Amazon Neptune is a fully managed graph database service that supports two graph models: the Property Graph model (queried with Apache TinkerPop Gremlin and openCypher) and the RDF model (queried with SPARQL). It is purpose-built for workloads where relationships between data are as important as the data itself - social networks, fraud detection, knowledge graphs, recommendation engines, and network topology analysis. Neptune stores billions of relationships and queries them with millisecond latency.
Graph Models and Query Languages
Neptune supports two fundamentally different graph paradigms. Most applications use the Property Graph model because it maps naturally to application data and Gremlin/openCypher are more developer-friendly than SPARQL.
| Model | Query Language | Data Structure | Best For |
|---|---|---|---|
| Property Graph | Gremlin (TinkerPop) or openCypher | Vertices (nodes) and Edges with properties | Social networks, fraud detection, recommendations |
| RDF (Resource Description Framework) | SPARQL | Triples: subject-predicate-object | Knowledge graphs, semantic web, linked data |
// Gremlin example: find all users who follow user "alice"
// and also follow user "bob"
g.V().has("User", "name", "alice")
.in("FOLLOWS")
.where(out("FOLLOWS").has("User", "name", "bob"))
.values("name")
// openCypher example: same query
MATCH (alice:User {name: "alice"})<-[:FOLLOWS]-(u:User)
-[:FOLLOWS]->(bob:User {name: "bob"})
RETURN u.name
// SPARQL example: find all papers citing paper X
SELECT ?paper ?title
WHERE {
?paper :cites :PaperX .
?paper :title ?title .
}Neptune Cluster Architecture and Storage
Neptune uses the same Aurora-derived distributed storage architecture: 6 copies of data across 3 AZs, automatic storage growth up to 128 TB, and shared storage between the primary and up to 15 read replicas.
| Feature | Detail |
|---|---|
| Storage | Distributed cluster volume; 6 replicas across 3 AZs; auto-grows in 10 GB chunks |
| Max storage | 128 TB |
| Read replicas | Up to 15; share cluster volume (near-zero replica lag) |
| Failover | Automatic Multi-AZ failover in < 30 seconds |
| Instance types | r5, r6g, x2g families; memory-optimized for large graphs |
| Neptune Serverless | Auto-scales NCUs (Neptune Capacity Units); good for dev/variable workloads |
Graph databases are memory-intensive because traversals require holding graph structure in memory. Choose r5 or r6g (memory-optimized) instance types for production Neptune clusters and allocate enough memory to hold your working set.
Use Cases: When to Choose Neptune Over Relational or Document DBs
Graph databases excel when relationship traversal depth and complexity is the core operation. Relational databases struggle with deep multi-hop joins (JOIN 5+ tables) which are trivial in a graph model.
| Use Case | Why Graph Is Better | Example Query |
|---|---|---|
| Fraud detection | Find shared devices/IPs/cards across account networks in real time | g.V(account).repeat(out()).times(4).has("flagged", true) |
| Social network | Friends of friends, mutual connections, influence scoring | Match (u:User)-[:FRIENDS*2..3]->(recommendation) |
| Recommendation engine | Collaborative filtering via relationship traversal | Users who bought X also bought Y |
| Knowledge graph | Entity relationships across domains | SPARQL over RDF triples |
| Network topology | Shortest path, connected components, dependency analysis | shortestPath() from source to target |
| Identity graph | Link user identities across devices and channels | Traverse SAME_AS edges to canonical identity |
Do not use Neptune as a general-purpose database. It has no support for ad-hoc SQL, limited aggregation capabilities compared to OLAP databases, and is significantly more expensive than relational databases for data that is not relationship-centric. Use it only when traversal queries are the dominant access pattern.
Data Loading: Neptune Loader and Streams
Neptune provides two primary data ingestion paths: the Neptune Loader API for bulk CSV ingestion from S3, and Neptune Streams for change data capture.
# Bulk load vertices from S3 CSV
# vertex.csv format: ~id, ~label, name:String, age:Int
curl -X POST https://your-cluster.neptune.amazonaws.com:8182/loader \
-H 'Content-Type: application/json' \
-d '{
"source": "s3://my-bucket/graph-data/vertices.csv",
"format": "csv",
"iamRoleArn": "arn:aws:iam::123456789012:role/NeptuneLoadRole",
"region": "us-east-1",
"failOnError": "FALSE",
"parallelism": "MEDIUM"
}'
# Check load status
curl https://your-cluster.neptune.amazonaws.com:8182/loader/<loadId>The Neptune Loader reads from S3 in parallel and is significantly faster than inserting vertices/edges one at a time via Gremlin. For initial data loads or large batch updates always use the Loader API with S3 as the staging area.
Pricing Model
| Component | Pricing Basis | Tip |
|---|---|---|
| Instance hours | Per hour by instance class | Reserve production instances for savings |
| Storage | Per GB-month (auto-grows) | No pre-provisioning needed |
| I/O requests | Per million I/Os | Optimize traversal depth to reduce I/O |
| Neptune Serverless | Per NCU-second (min 2.5 NCUs) | Good for dev and unpredictable workloads |
| Streams | Per GB read from stream | Enable only if you need CDC |
| Backup storage | Free up to cluster size | Standard Aurora-style backup pricing |
Interview Focus Points
- 1What is a graph database and when would you choose Neptune over a relational database?
- 2Explain the difference between the Property Graph model (Gremlin/openCypher) and RDF (SPARQL).
- 3Describe a fraud detection use case using Neptune. What does the graph model look like?
- 4How does Neptune's storage architecture compare to a standard relational database?
- 5What are the limitations of Neptune? What types of workloads is it NOT suited for?
- 6How do you bulk load data into Neptune efficiently?
- 7What is Neptune Streams and what problem does it solve?
- 8How does Neptune handle graph traversals at scale - what are the performance considerations?