Neptune

Fully managed graph database for building applications with highly connected datasets

Amazon Neptune is a fully managed graph database service that supports two graph models: the Property Graph model (queried with Apache TinkerPop Gremlin and openCypher) and the RDF model (queried with SPARQL). It is purpose-built for workloads where relationships between data are as important as the data itself - social networks, fraud detection, knowledge graphs, recommendation engines, and network topology analysis. Neptune stores billions of relationships and queries them with millisecond latency.

Graph Models and Query Languages

Neptune supports two fundamentally different graph paradigms. Most applications use the Property Graph model because it maps naturally to application data and Gremlin/openCypher are more developer-friendly than SPARQL.

Model	Query Language	Data Structure	Best For
Property Graph	Gremlin (TinkerPop) or openCypher	Vertices (nodes) and Edges with properties	Social networks, fraud detection, recommendations
RDF (Resource Description Framework)	SPARQL	Triples: subject-predicate-object	Knowledge graphs, semantic web, linked data

bash

// Gremlin example: find all users who follow user "alice"
// and also follow user "bob"
g.V().has("User", "name", "alice")
  .in("FOLLOWS")
  .where(out("FOLLOWS").has("User", "name", "bob"))
  .values("name")

// openCypher example: same query
MATCH (alice:User {name: "alice"})<-[:FOLLOWS]-(u:User)
      -[:FOLLOWS]->(bob:User {name: "bob"})
RETURN u.name

// SPARQL example: find all papers citing paper X
SELECT ?paper ?title
WHERE {
  ?paper :cites :PaperX .
  ?paper :title ?title .
}

Neptune Cluster Architecture and Storage

Neptune uses the same Aurora-derived distributed storage architecture: 6 copies of data across 3 AZs, automatic storage growth up to 128 TB, and shared storage between the primary and up to 15 read replicas.

Feature	Detail
Storage	Distributed cluster volume; 6 replicas across 3 AZs; auto-grows in 10 GB chunks
Max storage	128 TB
Read replicas	Up to 15; share cluster volume (near-zero replica lag)
Failover	Automatic Multi-AZ failover in < 30 seconds
Instance types	r5, r6g, x2g families; memory-optimized for large graphs
Neptune Serverless	Auto-scales NCUs (Neptune Capacity Units); good for dev/variable workloads

💡

Graph databases are memory-intensive because traversals require holding graph structure in memory. Choose r5 or r6g (memory-optimized) instance types for production Neptune clusters and allocate enough memory to hold your working set.

Use Cases: When to Choose Neptune Over Relational or Document DBs

Graph databases excel when relationship traversal depth and complexity is the core operation. Relational databases struggle with deep multi-hop joins (JOIN 5+ tables) which are trivial in a graph model.

Use Case	Why Graph Is Better	Example Query
Fraud detection	Find shared devices/IPs/cards across account networks in real time	g.V(account).repeat(out()).times(4).has("flagged", true)
Social network	Friends of friends, mutual connections, influence scoring	Match (u:User)-[:FRIENDS*2..3]->(recommendation)
Recommendation engine	Collaborative filtering via relationship traversal	Users who bought X also bought Y
Knowledge graph	Entity relationships across domains	SPARQL over RDF triples
Network topology	Shortest path, connected components, dependency analysis	shortestPath() from source to target
Identity graph	Link user identities across devices and channels	Traverse SAME_AS edges to canonical identity

⚠️

Do not use Neptune as a general-purpose database. It has no support for ad-hoc SQL, limited aggregation capabilities compared to OLAP databases, and is significantly more expensive than relational databases for data that is not relationship-centric. Use it only when traversal queries are the dominant access pattern.

Data Loading: Neptune Loader and Streams

Neptune provides two primary data ingestion paths: the Neptune Loader API for bulk CSV ingestion from S3, and Neptune Streams for change data capture.

bash

# Bulk load vertices from S3 CSV
# vertex.csv format: ~id, ~label, name:String, age:Int
curl -X POST https://your-cluster.neptune.amazonaws.com:8182/loader \
  -H 'Content-Type: application/json' \
  -d '{
    "source": "s3://my-bucket/graph-data/vertices.csv",
    "format": "csv",
    "iamRoleArn": "arn:aws:iam::123456789012:role/NeptuneLoadRole",
    "region": "us-east-1",
    "failOnError": "FALSE",
    "parallelism": "MEDIUM"
  }'

# Check load status
curl https://your-cluster.neptune.amazonaws.com:8182/loader/<loadId>

💡

The Neptune Loader reads from S3 in parallel and is significantly faster than inserting vertices/edges one at a time via Gremlin. For initial data loads or large batch updates always use the Loader API with S3 as the staging area.

Pricing Model

Component	Pricing Basis	Tip
Instance hours	Per hour by instance class	Reserve production instances for savings
Storage	Per GB-month (auto-grows)	No pre-provisioning needed
I/O requests	Per million I/Os	Optimize traversal depth to reduce I/O
Neptune Serverless	Per NCU-second (min 2.5 NCUs)	Good for dev and unpredictable workloads
Streams	Per GB read from stream	Enable only if you need CDC
Backup storage	Free up to cluster size	Standard Aurora-style backup pricing

🎯

Interview Focus Points

1What is a graph database and when would you choose Neptune over a relational database?
2Explain the difference between the Property Graph model (Gremlin/openCypher) and RDF (SPARQL).
3Describe a fraud detection use case using Neptune. What does the graph model look like?
4How does Neptune's storage architecture compare to a standard relational database?
5What are the limitations of Neptune? What types of workloads is it NOT suited for?
6How do you bulk load data into Neptune efficiently?
7What is Neptune Streams and what problem does it solve?
8How does Neptune handle graph traversals at scale - what are the performance considerations?