Ace Cloud Interviews
🤖

AWS AI & Machine Learning

Textract

Extract text, tables, and form fields from scanned documents automatically

Amazon Textract is a fully managed ML service that automatically extracts text, handwriting, tables, forms, and structured data from scanned documents, PDFs, and images - going far beyond simple OCR. Unlike generic OCR, Textract understands document structure and can return form field key-value pairs and table cells with their positional relationships intact. For cloud engineers, Textract is the foundation of intelligent document processing pipelines that automate workflows involving invoices, contracts, medical records, and government forms.

Textract Feature Types - Choosing the Right Extraction Mode

Textract has several analysis modes, each targeting a different document structure type. Using the wrong feature type wastes API calls and misses structure.

Feature TypeWhat It ExtractsExtra ChargeBest For
TABLESTable cells with row/column positionYesInvoices, spreadsheets, financial statements
FORMSKey-value pairs (form field labels + values)YesApplication forms, questionnaires, tax documents
SIGNATURESSignature presence detectionYesContract validation, consent forms
LAYOUTDocument layout elements - titles, headers, sectionsYesComplex reports, books, structured documents
QUERIESNatural language questions answered from the documentYesTargeted extraction of specific fields without parsing all output
💡

Basic text extraction (BLOCKS of LINE and WORD type) is always included. You only pay extra for TABLES, FORMS, SIGNATURES, LAYOUT, and QUERIES. Only request the feature types you need.

Synchronous vs Asynchronous APIs

Textract provides two sets of APIs - synchronous for small single-page documents and asynchronous for multi-page PDFs.

APIMax PagesMax SizeResponseWhen to Use
DetectDocumentText1 page10 MBSynchronous JSONSingle-page images, real-time OCR
AnalyzeDocument1 page10 MBSynchronous JSONSingle-page forms, tables, queries
StartDocumentTextDetectionUp to 3,000 pages500 MBJobId + SNS/pollMulti-page PDFs, bulk processing
StartDocumentAnalysisUp to 3,000 pages500 MBJobId + SNS/pollMulti-page forms, tables in PDFs
StartExpenseAnalysisUp to 3,000 pages500 MBJobId + SNS/pollInvoices, receipts - specialized expense fields
StartLendingAnalysisUp to 3,000 pages500 MBJobId + SNS/pollMortgage, lending documents (1003, W-2, paystubs)
bash
# Async multi-page PDF analysis with SNS notification
import boto3

textract = boto3.client('textract', region_name='us-east-1')

response = textract.start_document_analysis(
    DocumentLocation={
        'S3Object': {'Bucket': 'my-documents', 'Name': 'contract.pdf'}
    },
    FeatureTypes=['TABLES', 'FORMS'],
    NotificationChannel={
        'SNSTopicArn': 'arn:aws:sns:us-east-1:123456789012:textract-complete',
        'RoleArn': 'arn:aws:iam::123456789012:role/TextractSNSRole'
    },
    OutputConfig={
        'S3Bucket': 'my-output-bucket',
        'S3Prefix': 'textract-results/'
    }
)
job_id = response['JobId']

Queries - Natural Language Document Extraction

The Queries feature (AnalyzeDocument with QUERIES type) lets you ask natural language questions about a document and get the specific value extracted - without having to parse all FORM or TABLE blocks.

bash
# Use Queries to extract specific fields without parsing all blocks
import boto3

textract = boto3.client('textract')

response = textract.analyze_document(
    Document={'S3Object': {'Bucket': 'invoices', 'Name': 'invoice-001.pdf'}},
    FeatureTypes=['QUERIES'],
    QueriesConfig={
        'Queries': [
            {'Text': 'What is the invoice number?', 'Alias': 'INVOICE_NUMBER'},
            {'Text': 'What is the total amount due?', 'Alias': 'TOTAL_AMOUNT'},
            {'Text': 'What is the invoice date?', 'Alias': 'INVOICE_DATE'},
            {'Text': 'What is the vendor name?', 'Alias': 'VENDOR_NAME'}
        ]
    }
)

for block in response['Blocks']:
    if block['BlockType'] == 'QUERY_RESULT':
        alias = block.get('Text', '')
        # Find the query to get the alias
        print(f"Confidence: {block['Confidence']:.1f}%")
💡

Queries are more accurate than FORMS for structured invoices and contracts because they use contextual document understanding rather than relying on spatial proximity of label and value.

Textract Pricing

FeaturePrice Per PageFree Tier
Text detection (DetectDocumentText)$0.00151,000 pages/month for 3 months
Document analysis - Forms + Tables$0.0151,000 pages/month for 3 months
Document analysis - Signatures$0.0151,000 pages/month for 3 months
Document analysis - Layout$0.004No free tier
Queries per page$0.01 per query (max $0.015/page)No free tier
Expense analysis$0.01No free tier
Lending document analysis$0.03No free tier
⚠️

Multi-page PDFs are billed per page, not per document. A 200-page contract analyzed with FORMS + TABLES costs $3.00. For high-volume document processing, measure the average page count of your documents carefully in your cost model.

🎯

Interview Focus Points

  • 1What is the difference between Textract and a standard OCR service?
  • 2When would you use AnalyzeDocument QUERIES vs FORMS for extracting fields from an invoice?
  • 3How does the async Textract API work? How do you get notified when a job completes?
  • 4How would you build a serverless invoice processing pipeline using Textract, Lambda, and DynamoDB?
  • 5What is the difference between StartExpenseAnalysis and StartDocumentAnalysis for receipts?
  • 6How does Textract handle handwritten text vs printed text?
  • 7What IAM permissions does Textract need to access S3 documents and publish SNS notifications?