AWS AI & Machine Learning
Textract
Extract text, tables, and form fields from scanned documents automatically
Amazon Textract is a fully managed ML service that automatically extracts text, handwriting, tables, forms, and structured data from scanned documents, PDFs, and images - going far beyond simple OCR. Unlike generic OCR, Textract understands document structure and can return form field key-value pairs and table cells with their positional relationships intact. For cloud engineers, Textract is the foundation of intelligent document processing pipelines that automate workflows involving invoices, contracts, medical records, and government forms.
Textract Feature Types - Choosing the Right Extraction Mode
Textract has several analysis modes, each targeting a different document structure type. Using the wrong feature type wastes API calls and misses structure.
| Feature Type | What It Extracts | Extra Charge | Best For |
|---|---|---|---|
| TABLES | Table cells with row/column position | Yes | Invoices, spreadsheets, financial statements |
| FORMS | Key-value pairs (form field labels + values) | Yes | Application forms, questionnaires, tax documents |
| SIGNATURES | Signature presence detection | Yes | Contract validation, consent forms |
| LAYOUT | Document layout elements - titles, headers, sections | Yes | Complex reports, books, structured documents |
| QUERIES | Natural language questions answered from the document | Yes | Targeted extraction of specific fields without parsing all output |
Basic text extraction (BLOCKS of LINE and WORD type) is always included. You only pay extra for TABLES, FORMS, SIGNATURES, LAYOUT, and QUERIES. Only request the feature types you need.
Synchronous vs Asynchronous APIs
Textract provides two sets of APIs - synchronous for small single-page documents and asynchronous for multi-page PDFs.
| API | Max Pages | Max Size | Response | When to Use |
|---|---|---|---|---|
| DetectDocumentText | 1 page | 10 MB | Synchronous JSON | Single-page images, real-time OCR |
| AnalyzeDocument | 1 page | 10 MB | Synchronous JSON | Single-page forms, tables, queries |
| StartDocumentTextDetection | Up to 3,000 pages | 500 MB | JobId + SNS/poll | Multi-page PDFs, bulk processing |
| StartDocumentAnalysis | Up to 3,000 pages | 500 MB | JobId + SNS/poll | Multi-page forms, tables in PDFs |
| StartExpenseAnalysis | Up to 3,000 pages | 500 MB | JobId + SNS/poll | Invoices, receipts - specialized expense fields |
| StartLendingAnalysis | Up to 3,000 pages | 500 MB | JobId + SNS/poll | Mortgage, lending documents (1003, W-2, paystubs) |
# Async multi-page PDF analysis with SNS notification
import boto3
textract = boto3.client('textract', region_name='us-east-1')
response = textract.start_document_analysis(
DocumentLocation={
'S3Object': {'Bucket': 'my-documents', 'Name': 'contract.pdf'}
},
FeatureTypes=['TABLES', 'FORMS'],
NotificationChannel={
'SNSTopicArn': 'arn:aws:sns:us-east-1:123456789012:textract-complete',
'RoleArn': 'arn:aws:iam::123456789012:role/TextractSNSRole'
},
OutputConfig={
'S3Bucket': 'my-output-bucket',
'S3Prefix': 'textract-results/'
}
)
job_id = response['JobId']
Queries - Natural Language Document Extraction
The Queries feature (AnalyzeDocument with QUERIES type) lets you ask natural language questions about a document and get the specific value extracted - without having to parse all FORM or TABLE blocks.
# Use Queries to extract specific fields without parsing all blocks
import boto3
textract = boto3.client('textract')
response = textract.analyze_document(
Document={'S3Object': {'Bucket': 'invoices', 'Name': 'invoice-001.pdf'}},
FeatureTypes=['QUERIES'],
QueriesConfig={
'Queries': [
{'Text': 'What is the invoice number?', 'Alias': 'INVOICE_NUMBER'},
{'Text': 'What is the total amount due?', 'Alias': 'TOTAL_AMOUNT'},
{'Text': 'What is the invoice date?', 'Alias': 'INVOICE_DATE'},
{'Text': 'What is the vendor name?', 'Alias': 'VENDOR_NAME'}
]
}
)
for block in response['Blocks']:
if block['BlockType'] == 'QUERY_RESULT':
alias = block.get('Text', '')
# Find the query to get the alias
print(f"Confidence: {block['Confidence']:.1f}%")
Queries are more accurate than FORMS for structured invoices and contracts because they use contextual document understanding rather than relying on spatial proximity of label and value.
Textract Pricing
| Feature | Price Per Page | Free Tier |
|---|---|---|
| Text detection (DetectDocumentText) | $0.0015 | 1,000 pages/month for 3 months |
| Document analysis - Forms + Tables | $0.015 | 1,000 pages/month for 3 months |
| Document analysis - Signatures | $0.015 | 1,000 pages/month for 3 months |
| Document analysis - Layout | $0.004 | No free tier |
| Queries per page | $0.01 per query (max $0.015/page) | No free tier |
| Expense analysis | $0.01 | No free tier |
| Lending document analysis | $0.03 | No free tier |
Multi-page PDFs are billed per page, not per document. A 200-page contract analyzed with FORMS + TABLES costs $3.00. For high-volume document processing, measure the average page count of your documents carefully in your cost model.
Interview Focus Points
- 1What is the difference between Textract and a standard OCR service?
- 2When would you use AnalyzeDocument QUERIES vs FORMS for extracting fields from an invoice?
- 3How does the async Textract API work? How do you get notified when a job completes?
- 4How would you build a serverless invoice processing pipeline using Textract, Lambda, and DynamoDB?
- 5What is the difference between StartExpenseAnalysis and StartDocumentAnalysis for receipts?
- 6How does Textract handle handwritten text vs printed text?
- 7What IAM permissions does Textract need to access S3 documents and publish SNS notifications?