Ace Cloud Interviews
Home/AWS Tutorial/Comprehend
🤖

AWS AI & Machine Learning

Comprehend

NLP service for entity recognition, sentiment analysis, and topic modeling

Amazon Comprehend is a fully managed NLP service that uses machine learning to find insights and relationships in text - including entities, key phrases, sentiment, language, PII, and topics. It works on any text without requiring you to train or maintain models. For cloud engineers, Comprehend is the go-to service for adding language intelligence to document processing pipelines, customer feedback analysis, and compliance workflows.

Built-in NLP Capabilities

Comprehend provides a rich set of pre-trained NLP capabilities accessible via API calls. No ML training required for any of these.

FeatureAPIWhat It Returns
Entity RecognitionDetectEntitiesNamed entities: PERSON, ORGANIZATION, LOCATION, DATE, QUANTITY, TITLE, COMMERCIAL_ITEM, EVENT, OTHER
Key Phrase ExtractionDetectKeyPhrasesNoun phrases that are most meaningful to the content
Sentiment AnalysisDetectSentimentPOSITIVE, NEGATIVE, NEUTRAL, MIXED with confidence scores
Language DetectionDetectDominantLanguageISO 639-1 language code and confidence score from 100+ languages
PII DetectionDetectPiiEntitiesPII types: NAME, ADDRESS, SSN, CREDIT_DEBIT_NUMBER, EMAIL, PHONE, etc.
PII RedactionContains PII entities API (async)Returns document with PII replaced by entity type label
Targeted SentimentDetectTargetedSentimentSentiment per mentioned entity in the text (not just overall)
Syntax AnalysisDetectSyntaxPart-of-speech tags for each token
💡

Targeted sentiment is particularly valuable for product reviews - you can learn that a customer feels POSITIVE about delivery speed but NEGATIVE about product quality in the same review.

Comprehend Custom - Training Your Own Classifiers and NER Models

Comprehend Custom lets you train text classification and named entity recognition models using your own labeled data. It uses transfer learning on top of Comprehend's base language models, so you need relatively little training data.

Custom FeatureMinimum Training DataUse Case Example
Custom Classifier (multi-class)10 examples per class, 5 classes minimumRoute support tickets to the right team (billing, technical, returns)
Custom Classifier (multi-label)50 labeled documents per labelTag documents with multiple categories simultaneously
Custom Entity Recognizer100 annotations per entity type OR entity listRecognize product SKUs, internal codes, medical terminology
bash
# Start a custom classifier training job
import boto3

comprehend = boto3.client('comprehend')

response = comprehend.create_document_classifier(
    DocumentClassifierName='support-ticket-router',
    DataAccessRoleArn='arn:aws:iam::123456789012:role/ComprehendRole',
    InputDataConfig={
        'DataFormat': 'COMPREHEND_CSV',
        'S3Uri': 's3://my-bucket/training-data/tickets.csv'
    },
    OutputDataConfig={'S3Uri': 's3://my-bucket/output/'},
    LanguageCode='en',
    Mode='MULTI_CLASS'
)
print(response['DocumentClassifierArn'])
⚠️

Custom classifiers and entity recognizers require an endpoint to be deployed for real-time inference (billed per hour). For batch processing, use StartDocumentClassificationJob instead - much cheaper for high volumes.

Batch Processing with Async Jobs

All Comprehend operations have both synchronous (single document) and asynchronous (batch) variants. For processing large document sets, always use the async batch jobs which are cheaper and more efficient.

Async Job TypeAPI CallInput FormatOutput
EntitiesStartEntitiesDetectionJobOne document per line in S3JSON per document in S3
SentimentStartSentimentDetectionJobOne document per line in S3JSON per document in S3
Key phrasesStartKeyPhrasesDetectionJobOne document per line in S3JSON per document in S3
Topic modelingStartTopicsDetectionJobDocuments in S3 (TXT or CSV)Topic-term matrix and document-topic mapping
PII RedactionStartPiiEntitiesDetectionJobOne document per lineRedacted documents in S3

Topic modeling is unique - it is an unsupervised LDA-based algorithm that discovers hidden topics across a corpus of documents. You specify the number of topics (typically 10-100) and Comprehend returns the top terms per topic.

Comprehend Pricing

FeaturePricing
Synchronous NLP (entities, sentiment, etc.)$0.0001 per unit (100 characters = 1 unit, 3 unit minimum)
Async NLP jobs$0.0001 per unit (same rate, better for batches)
Custom classifier training$3.00 per hour (billed per second)
Custom classifier real-time endpoint$0.0005 per unit + $0.50 per endpoint-hour
Custom entity recognizer training$3.00 per hour
Topic modeling$1.00 per job up to 1000 documents, then $0.001 per document
💡

The 3-unit minimum per API call means very short strings (under 300 characters) are charged for 300 characters. For high-volume classification of short texts, batch as many documents as possible per API call (up to 25 per batch).

🎯

Interview Focus Points

  • 1What is the difference between standard sentiment analysis and targeted sentiment in Comprehend?
  • 2How would you use Comprehend to build an automated document routing system for a support center?
  • 3When would you use Comprehend Custom Classifier vs Comprehend topic modeling?
  • 4What are the data format requirements for training a custom entity recognizer?
  • 5How does Comprehend PII detection compare to Macie for compliance use cases?
  • 6How would you design a pipeline to process 10 million customer reviews using Comprehend?
  • 7What is Comprehend topic modeling based on and what do you configure to get useful results?