Ace Cloud Interviews
🗃️

AWS Database

Neptune

Fully managed graph database for building applications with highly connected datasets

Amazon Neptune is a fully managed graph database service that supports two graph models: the Property Graph model (queried with Apache TinkerPop Gremlin and openCypher) and the RDF model (queried with SPARQL). It is purpose-built for workloads where relationships between data are as important as the data itself - social networks, fraud detection, knowledge graphs, recommendation engines, and network topology analysis. Neptune stores billions of relationships and queries them with millisecond latency.

Graph Models and Query Languages

Neptune supports two fundamentally different graph paradigms. Most applications use the Property Graph model because it maps naturally to application data and Gremlin/openCypher are more developer-friendly than SPARQL.

ModelQuery LanguageData StructureBest For
Property GraphGremlin (TinkerPop) or openCypherVertices (nodes) and Edges with propertiesSocial networks, fraud detection, recommendations
RDF (Resource Description Framework)SPARQLTriples: subject-predicate-objectKnowledge graphs, semantic web, linked data
bash
// Gremlin example: find all users who follow user "alice"
// and also follow user "bob"
g.V().has("User", "name", "alice")
  .in("FOLLOWS")
  .where(out("FOLLOWS").has("User", "name", "bob"))
  .values("name")

// openCypher example: same query
MATCH (alice:User {name: "alice"})<-[:FOLLOWS]-(u:User)
      -[:FOLLOWS]->(bob:User {name: "bob"})
RETURN u.name

// SPARQL example: find all papers citing paper X
SELECT ?paper ?title
WHERE {
  ?paper :cites :PaperX .
  ?paper :title ?title .
}

Neptune Cluster Architecture and Storage

Neptune uses the same Aurora-derived distributed storage architecture: 6 copies of data across 3 AZs, automatic storage growth up to 128 TB, and shared storage between the primary and up to 15 read replicas.

FeatureDetail
StorageDistributed cluster volume; 6 replicas across 3 AZs; auto-grows in 10 GB chunks
Max storage128 TB
Read replicasUp to 15; share cluster volume (near-zero replica lag)
FailoverAutomatic Multi-AZ failover in < 30 seconds
Instance typesr5, r6g, x2g families; memory-optimized for large graphs
Neptune ServerlessAuto-scales NCUs (Neptune Capacity Units); good for dev/variable workloads
💡

Graph databases are memory-intensive because traversals require holding graph structure in memory. Choose r5 or r6g (memory-optimized) instance types for production Neptune clusters and allocate enough memory to hold your working set.

Use Cases: When to Choose Neptune Over Relational or Document DBs

Graph databases excel when relationship traversal depth and complexity is the core operation. Relational databases struggle with deep multi-hop joins (JOIN 5+ tables) which are trivial in a graph model.

Use CaseWhy Graph Is BetterExample Query
Fraud detectionFind shared devices/IPs/cards across account networks in real timeg.V(account).repeat(out()).times(4).has("flagged", true)
Social networkFriends of friends, mutual connections, influence scoringMatch (u:User)-[:FRIENDS*2..3]->(recommendation)
Recommendation engineCollaborative filtering via relationship traversalUsers who bought X also bought Y
Knowledge graphEntity relationships across domainsSPARQL over RDF triples
Network topologyShortest path, connected components, dependency analysisshortestPath() from source to target
Identity graphLink user identities across devices and channelsTraverse SAME_AS edges to canonical identity
⚠️

Do not use Neptune as a general-purpose database. It has no support for ad-hoc SQL, limited aggregation capabilities compared to OLAP databases, and is significantly more expensive than relational databases for data that is not relationship-centric. Use it only when traversal queries are the dominant access pattern.

Data Loading: Neptune Loader and Streams

Neptune provides two primary data ingestion paths: the Neptune Loader API for bulk CSV ingestion from S3, and Neptune Streams for change data capture.

bash
# Bulk load vertices from S3 CSV
# vertex.csv format: ~id, ~label, name:String, age:Int
curl -X POST https://your-cluster.neptune.amazonaws.com:8182/loader \
  -H 'Content-Type: application/json' \
  -d '{
    "source": "s3://my-bucket/graph-data/vertices.csv",
    "format": "csv",
    "iamRoleArn": "arn:aws:iam::123456789012:role/NeptuneLoadRole",
    "region": "us-east-1",
    "failOnError": "FALSE",
    "parallelism": "MEDIUM"
  }'

# Check load status
curl https://your-cluster.neptune.amazonaws.com:8182/loader/<loadId>
💡

The Neptune Loader reads from S3 in parallel and is significantly faster than inserting vertices/edges one at a time via Gremlin. For initial data loads or large batch updates always use the Loader API with S3 as the staging area.

Pricing Model

ComponentPricing BasisTip
Instance hoursPer hour by instance classReserve production instances for savings
StoragePer GB-month (auto-grows)No pre-provisioning needed
I/O requestsPer million I/OsOptimize traversal depth to reduce I/O
Neptune ServerlessPer NCU-second (min 2.5 NCUs)Good for dev and unpredictable workloads
StreamsPer GB read from streamEnable only if you need CDC
Backup storageFree up to cluster sizeStandard Aurora-style backup pricing
🎯

Interview Focus Points

  • 1What is a graph database and when would you choose Neptune over a relational database?
  • 2Explain the difference between the Property Graph model (Gremlin/openCypher) and RDF (SPARQL).
  • 3Describe a fraud detection use case using Neptune. What does the graph model look like?
  • 4How does Neptune's storage architecture compare to a standard relational database?
  • 5What are the limitations of Neptune? What types of workloads is it NOT suited for?
  • 6How do you bulk load data into Neptune efficiently?
  • 7What is Neptune Streams and what problem does it solve?
  • 8How does Neptune handle graph traversals at scale - what are the performance considerations?