Comprehend

NLP service for entity recognition, sentiment analysis, and topic modeling

Amazon Comprehend is a fully managed NLP service that uses machine learning to find insights and relationships in text - including entities, key phrases, sentiment, language, PII, and topics. It works on any text without requiring you to train or maintain models. For cloud engineers, Comprehend is the go-to service for adding language intelligence to document processing pipelines, customer feedback analysis, and compliance workflows.

Built-in NLP Capabilities

Comprehend provides a rich set of pre-trained NLP capabilities accessible via API calls. No ML training required for any of these.

Feature	API	What It Returns
Entity Recognition	DetectEntities	Named entities: PERSON, ORGANIZATION, LOCATION, DATE, QUANTITY, TITLE, COMMERCIAL_ITEM, EVENT, OTHER
Key Phrase Extraction	DetectKeyPhrases	Noun phrases that are most meaningful to the content
Sentiment Analysis	DetectSentiment	POSITIVE, NEGATIVE, NEUTRAL, MIXED with confidence scores
Language Detection	DetectDominantLanguage	ISO 639-1 language code and confidence score from 100+ languages
PII Detection	DetectPiiEntities	PII types: NAME, ADDRESS, SSN, CREDIT_DEBIT_NUMBER, EMAIL, PHONE, etc.
PII Redaction	Contains PII entities API (async)	Returns document with PII replaced by entity type label
Targeted Sentiment	DetectTargetedSentiment	Sentiment per mentioned entity in the text (not just overall)
Syntax Analysis	DetectSyntax	Part-of-speech tags for each token

💡

Targeted sentiment is particularly valuable for product reviews - you can learn that a customer feels POSITIVE about delivery speed but NEGATIVE about product quality in the same review.

Comprehend Custom - Training Your Own Classifiers and NER Models

Comprehend Custom lets you train text classification and named entity recognition models using your own labeled data. It uses transfer learning on top of Comprehend's base language models, so you need relatively little training data.

Custom Feature	Minimum Training Data	Use Case Example
Custom Classifier (multi-class)	10 examples per class, 5 classes minimum	Route support tickets to the right team (billing, technical, returns)
Custom Classifier (multi-label)	50 labeled documents per label	Tag documents with multiple categories simultaneously
Custom Entity Recognizer	100 annotations per entity type OR entity list	Recognize product SKUs, internal codes, medical terminology

bash

# Start a custom classifier training job
import boto3

comprehend = boto3.client('comprehend')

response = comprehend.create_document_classifier(
    DocumentClassifierName='support-ticket-router',
    DataAccessRoleArn='arn:aws:iam::123456789012:role/ComprehendRole',
    InputDataConfig={
        'DataFormat': 'COMPREHEND_CSV',
        'S3Uri': 's3://my-bucket/training-data/tickets.csv'
    },
    OutputDataConfig={'S3Uri': 's3://my-bucket/output/'},
    LanguageCode='en',
    Mode='MULTI_CLASS'
)
print(response['DocumentClassifierArn'])

⚠️

Custom classifiers and entity recognizers require an endpoint to be deployed for real-time inference (billed per hour). For batch processing, use StartDocumentClassificationJob instead - much cheaper for high volumes.

Batch Processing with Async Jobs

All Comprehend operations have both synchronous (single document) and asynchronous (batch) variants. For processing large document sets, always use the async batch jobs which are cheaper and more efficient.

Async Job Type	API Call	Input Format	Output
Entities	StartEntitiesDetectionJob	One document per line in S3	JSON per document in S3
Sentiment	StartSentimentDetectionJob	One document per line in S3	JSON per document in S3
Key phrases	StartKeyPhrasesDetectionJob	One document per line in S3	JSON per document in S3
Topic modeling	StartTopicsDetectionJob	Documents in S3 (TXT or CSV)	Topic-term matrix and document-topic mapping
PII Redaction	StartPiiEntitiesDetectionJob	One document per line	Redacted documents in S3

Topic modeling is unique - it is an unsupervised LDA-based algorithm that discovers hidden topics across a corpus of documents. You specify the number of topics (typically 10-100) and Comprehend returns the top terms per topic.

Comprehend Pricing

Feature	Pricing
Synchronous NLP (entities, sentiment, etc.)	$0.0001 per unit (100 characters = 1 unit, 3 unit minimum)
Async NLP jobs	$0.0001 per unit (same rate, better for batches)
Custom classifier training	$3.00 per hour (billed per second)
Custom classifier real-time endpoint	$0.0005 per unit + $0.50 per endpoint-hour
Custom entity recognizer training	$3.00 per hour
Topic modeling	$1.00 per job up to 1000 documents, then $0.001 per document

💡

The 3-unit minimum per API call means very short strings (under 300 characters) are charged for 300 characters. For high-volume classification of short texts, batch as many documents as possible per API call (up to 25 per batch).

🎯

Interview Focus Points

1What is the difference between standard sentiment analysis and targeted sentiment in Comprehend?
2How would you use Comprehend to build an automated document routing system for a support center?
3When would you use Comprehend Custom Classifier vs Comprehend topic modeling?
4What are the data format requirements for training a custom entity recognizer?
5How does Comprehend PII detection compare to Macie for compliance use cases?
6How would you design a pipeline to process 10 million customer reviews using Comprehend?
7What is Comprehend topic modeling based on and what do you configure to get useful results?