Textract

Extract text, tables, and form fields from scanned documents automatically

Amazon Textract is a fully managed ML service that automatically extracts text, handwriting, tables, forms, and structured data from scanned documents, PDFs, and images - going far beyond simple OCR. Unlike generic OCR, Textract understands document structure and can return form field key-value pairs and table cells with their positional relationships intact. For cloud engineers, Textract is the foundation of intelligent document processing pipelines that automate workflows involving invoices, contracts, medical records, and government forms.

Textract Feature Types - Choosing the Right Extraction Mode

Textract has several analysis modes, each targeting a different document structure type. Using the wrong feature type wastes API calls and misses structure.

Feature Type	What It Extracts	Extra Charge	Best For
TABLES	Table cells with row/column position	Yes	Invoices, spreadsheets, financial statements
FORMS	Key-value pairs (form field labels + values)	Yes	Application forms, questionnaires, tax documents
SIGNATURES	Signature presence detection	Yes	Contract validation, consent forms
LAYOUT	Document layout elements - titles, headers, sections	Yes	Complex reports, books, structured documents
QUERIES	Natural language questions answered from the document	Yes	Targeted extraction of specific fields without parsing all output

💡

Basic text extraction (BLOCKS of LINE and WORD type) is always included. You only pay extra for TABLES, FORMS, SIGNATURES, LAYOUT, and QUERIES. Only request the feature types you need.

Synchronous vs Asynchronous APIs

Textract provides two sets of APIs - synchronous for small single-page documents and asynchronous for multi-page PDFs.

API	Max Pages	Max Size	Response	When to Use
DetectDocumentText	1 page	10 MB	Synchronous JSON	Single-page images, real-time OCR
AnalyzeDocument	1 page	10 MB	Synchronous JSON	Single-page forms, tables, queries
StartDocumentTextDetection	Up to 3,000 pages	500 MB	JobId + SNS/poll	Multi-page PDFs, bulk processing
StartDocumentAnalysis	Up to 3,000 pages	500 MB	JobId + SNS/poll	Multi-page forms, tables in PDFs
StartExpenseAnalysis	Up to 3,000 pages	500 MB	JobId + SNS/poll	Invoices, receipts - specialized expense fields
StartLendingAnalysis	Up to 3,000 pages	500 MB	JobId + SNS/poll	Mortgage, lending documents (1003, W-2, paystubs)

bash

# Async multi-page PDF analysis with SNS notification
import boto3

textract = boto3.client('textract', region_name='us-east-1')

response = textract.start_document_analysis(
    DocumentLocation={
        'S3Object': {'Bucket': 'my-documents', 'Name': 'contract.pdf'}
    },
    FeatureTypes=['TABLES', 'FORMS'],
    NotificationChannel={
        'SNSTopicArn': 'arn:aws:sns:us-east-1:123456789012:textract-complete',
        'RoleArn': 'arn:aws:iam::123456789012:role/TextractSNSRole'
    },
    OutputConfig={
        'S3Bucket': 'my-output-bucket',
        'S3Prefix': 'textract-results/'
    }
)
job_id = response['JobId']

Queries - Natural Language Document Extraction

The Queries feature (AnalyzeDocument with QUERIES type) lets you ask natural language questions about a document and get the specific value extracted - without having to parse all FORM or TABLE blocks.

bash

# Use Queries to extract specific fields without parsing all blocks
import boto3

textract = boto3.client('textract')

response = textract.analyze_document(
    Document={'S3Object': {'Bucket': 'invoices', 'Name': 'invoice-001.pdf'}},
    FeatureTypes=['QUERIES'],
    QueriesConfig={
        'Queries': [
            {'Text': 'What is the invoice number?', 'Alias': 'INVOICE_NUMBER'},
            {'Text': 'What is the total amount due?', 'Alias': 'TOTAL_AMOUNT'},
            {'Text': 'What is the invoice date?', 'Alias': 'INVOICE_DATE'},
            {'Text': 'What is the vendor name?', 'Alias': 'VENDOR_NAME'}
        ]
    }
)

for block in response['Blocks']:
    if block['BlockType'] == 'QUERY_RESULT':
        alias = block.get('Text', '')
        # Find the query to get the alias
        print(f"Confidence: {block['Confidence']:.1f}%")

💡

Queries are more accurate than FORMS for structured invoices and contracts because they use contextual document understanding rather than relying on spatial proximity of label and value.

Textract Pricing

Feature	Price Per Page	Free Tier
Text detection (DetectDocumentText)	$0.0015	1,000 pages/month for 3 months
Document analysis - Forms + Tables	$0.015	1,000 pages/month for 3 months
Document analysis - Signatures	$0.015	1,000 pages/month for 3 months
Document analysis - Layout	$0.004	No free tier
Queries per page	$0.01 per query (max $0.015/page)	No free tier
Expense analysis	$0.01	No free tier
Lending document analysis	$0.03	No free tier

⚠️

Multi-page PDFs are billed per page, not per document. A 200-page contract analyzed with FORMS + TABLES costs $3.00. For high-volume document processing, measure the average page count of your documents carefully in your cost model.

🎯

Interview Focus Points

1What is the difference between Textract and a standard OCR service?
2When would you use AnalyzeDocument QUERIES vs FORMS for extracting fields from an invoice?
3How does the async Textract API work? How do you get notified when a job completes?
4How would you build a serverless invoice processing pipeline using Textract, Lambda, and DynamoDB?
5What is the difference between StartExpenseAnalysis and StartDocumentAnalysis for receipts?
6How does Textract handle handwritten text vs printed text?
7What IAM permissions does Textract need to access S3 documents and publish SNS notifications?