AWS AI & Machine Learning
Comprehend
NLP service for entity recognition, sentiment analysis, and topic modeling
Amazon Comprehend is a fully managed NLP service that uses machine learning to find insights and relationships in text - including entities, key phrases, sentiment, language, PII, and topics. It works on any text without requiring you to train or maintain models. For cloud engineers, Comprehend is the go-to service for adding language intelligence to document processing pipelines, customer feedback analysis, and compliance workflows.
Built-in NLP Capabilities
Comprehend provides a rich set of pre-trained NLP capabilities accessible via API calls. No ML training required for any of these.
| Feature | API | What It Returns |
|---|---|---|
| Entity Recognition | DetectEntities | Named entities: PERSON, ORGANIZATION, LOCATION, DATE, QUANTITY, TITLE, COMMERCIAL_ITEM, EVENT, OTHER |
| Key Phrase Extraction | DetectKeyPhrases | Noun phrases that are most meaningful to the content |
| Sentiment Analysis | DetectSentiment | POSITIVE, NEGATIVE, NEUTRAL, MIXED with confidence scores |
| Language Detection | DetectDominantLanguage | ISO 639-1 language code and confidence score from 100+ languages |
| PII Detection | DetectPiiEntities | PII types: NAME, ADDRESS, SSN, CREDIT_DEBIT_NUMBER, EMAIL, PHONE, etc. |
| PII Redaction | Contains PII entities API (async) | Returns document with PII replaced by entity type label |
| Targeted Sentiment | DetectTargetedSentiment | Sentiment per mentioned entity in the text (not just overall) |
| Syntax Analysis | DetectSyntax | Part-of-speech tags for each token |
Targeted sentiment is particularly valuable for product reviews - you can learn that a customer feels POSITIVE about delivery speed but NEGATIVE about product quality in the same review.
Comprehend Custom - Training Your Own Classifiers and NER Models
Comprehend Custom lets you train text classification and named entity recognition models using your own labeled data. It uses transfer learning on top of Comprehend's base language models, so you need relatively little training data.
| Custom Feature | Minimum Training Data | Use Case Example |
|---|---|---|
| Custom Classifier (multi-class) | 10 examples per class, 5 classes minimum | Route support tickets to the right team (billing, technical, returns) |
| Custom Classifier (multi-label) | 50 labeled documents per label | Tag documents with multiple categories simultaneously |
| Custom Entity Recognizer | 100 annotations per entity type OR entity list | Recognize product SKUs, internal codes, medical terminology |
# Start a custom classifier training job
import boto3
comprehend = boto3.client('comprehend')
response = comprehend.create_document_classifier(
DocumentClassifierName='support-ticket-router',
DataAccessRoleArn='arn:aws:iam::123456789012:role/ComprehendRole',
InputDataConfig={
'DataFormat': 'COMPREHEND_CSV',
'S3Uri': 's3://my-bucket/training-data/tickets.csv'
},
OutputDataConfig={'S3Uri': 's3://my-bucket/output/'},
LanguageCode='en',
Mode='MULTI_CLASS'
)
print(response['DocumentClassifierArn'])
Custom classifiers and entity recognizers require an endpoint to be deployed for real-time inference (billed per hour). For batch processing, use StartDocumentClassificationJob instead - much cheaper for high volumes.
Batch Processing with Async Jobs
All Comprehend operations have both synchronous (single document) and asynchronous (batch) variants. For processing large document sets, always use the async batch jobs which are cheaper and more efficient.
| Async Job Type | API Call | Input Format | Output |
|---|---|---|---|
| Entities | StartEntitiesDetectionJob | One document per line in S3 | JSON per document in S3 |
| Sentiment | StartSentimentDetectionJob | One document per line in S3 | JSON per document in S3 |
| Key phrases | StartKeyPhrasesDetectionJob | One document per line in S3 | JSON per document in S3 |
| Topic modeling | StartTopicsDetectionJob | Documents in S3 (TXT or CSV) | Topic-term matrix and document-topic mapping |
| PII Redaction | StartPiiEntitiesDetectionJob | One document per line | Redacted documents in S3 |
Topic modeling is unique - it is an unsupervised LDA-based algorithm that discovers hidden topics across a corpus of documents. You specify the number of topics (typically 10-100) and Comprehend returns the top terms per topic.
Comprehend Pricing
| Feature | Pricing |
|---|---|
| Synchronous NLP (entities, sentiment, etc.) | $0.0001 per unit (100 characters = 1 unit, 3 unit minimum) |
| Async NLP jobs | $0.0001 per unit (same rate, better for batches) |
| Custom classifier training | $3.00 per hour (billed per second) |
| Custom classifier real-time endpoint | $0.0005 per unit + $0.50 per endpoint-hour |
| Custom entity recognizer training | $3.00 per hour |
| Topic modeling | $1.00 per job up to 1000 documents, then $0.001 per document |
The 3-unit minimum per API call means very short strings (under 300 characters) are charged for 300 characters. For high-volume classification of short texts, batch as many documents as possible per API call (up to 25 per batch).
Interview Focus Points
- 1What is the difference between standard sentiment analysis and targeted sentiment in Comprehend?
- 2How would you use Comprehend to build an automated document routing system for a support center?
- 3When would you use Comprehend Custom Classifier vs Comprehend topic modeling?
- 4What are the data format requirements for training a custom entity recognizer?
- 5How does Comprehend PII detection compare to Macie for compliance use cases?
- 6How would you design a pipeline to process 10 million customer reviews using Comprehend?
- 7What is Comprehend topic modeling based on and what do you configure to get useful results?