Transcribe

Automatic speech recognition to convert audio and video to text

Amazon Transcribe is a fully managed automatic speech recognition (ASR) service that converts audio and video files - or real-time audio streams - to text using deep learning models trained on a wide variety of speech patterns. It supports 100+ languages, speaker diarization, custom vocabulary, content redaction, and call analytics. For cloud engineers, Transcribe is the foundation of voice-driven data pipelines, accessibility solutions, meeting transcription systems, and contact center analytics platforms.

Batch Transcription vs Streaming Transcription

Transcribe operates in two fundamental modes - batch (for recorded audio files) and streaming (for live audio). Each has distinct API surfaces and use cases.

Aspect	Batch Transcription	Streaming Transcription
API	StartTranscriptionJob (async)	WebSocket or HTTP/2 streaming
Input	S3 object (MP3, MP4, WAV, FLAC, OGG, AMR, WebM)	Real-time PCM or FLAC audio stream
Output	JSON file written to S3	Partial and final results in WebSocket messages
Latency	Minutes to hours depending on file length	Hundreds of milliseconds (partial results faster)
Max duration	4 hours per job	Unlimited (session-based)
Speaker diarization	Yes - detect up to 10 speakers	Yes (single channel, up to 2 speakers)
Use case	Meeting recordings, podcast transcription, media archives	Live captions, real-time call centers, voice commands

bash

# Start a batch transcription job
import boto3

transcribe = boto3.client('transcribe', region_name='us-east-1')

response = transcribe.start_transcription_job(
    TranscriptionJobName='meeting-2024-01-15',
    LanguageCode='en-US',
    MediaFormat='mp4',
    Media={'MediaFileUri': 's3://my-recordings/meeting-2024-01-15.mp4'},
    OutputBucketName='my-transcripts',
    OutputKey='transcripts/meeting-2024-01-15.json',
    Settings={
        'ShowSpeakerLabels': True,
        'MaxSpeakerLabels': 4,
        'ShowAlternatives': False
    }
)
print(response['TranscriptionJob']['TranscriptionJobStatus'])

Key Features for Production Deployments

Feature	How to Enable	Use Case
Custom Vocabulary	VocabularyName in StartTranscriptionJob	Improve accuracy for domain-specific terms, brand names, acronyms
Custom Language Model (CLM)	ModelSettings.LanguageModelName	Fine-tuned model on domain-specific text data
Speaker Diarization	Settings.ShowSpeakerLabels = true	Attribute text to different speakers in multi-party calls
PII Redaction	ContentRedaction.RedactionType = PII	Replace PII (SSN, DOB, address) with [PII] in transcript
Vocabulary Filtering	Settings.VocabularyFilterName	Remove or tag profanity or sensitive words in output
Custom Pronunciation	VocabularyFilterMethod	Correct misheard product names or technical terms
Automatic Language Identification	IdentifyLanguage = true	Transcribe multilingual calls without knowing the language
Toxicity Detection	Settings.ToxicityDetection	Flag toxic speech in call center transcripts

💡

Custom Language Models (CLM) provide significantly better accuracy than custom vocabulary alone for specialized domains (medical, legal, technical). You provide a text corpus of domain content, Transcribe trains the adaptation model. CLM training takes 6-10 hours.

Transcribe Call Analytics - Contact Center Intelligence

Transcribe Call Analytics is a specialized variant of Transcribe designed specifically for contact center recordings. It adds call-specific intelligence on top of raw transcription.

Call Analytics Feature	What It Provides
Turn-by-turn sentiment	Sentiment score (positive/negative/neutral) per speaker utterance, not just overall
Issue detection	Automatically detect what the call was about (billing issue, technical problem, etc.)
Action items	Extract commitments made during the call ("I will email you by Friday")
Call characteristics	Non-talk time, interruptions, loudness, talk speed per speaker
Agent/Customer labels	Correct speaker label assignment for two-party calls
Call categories	Rule-based event tagging (e.g., flag if agent did not say required compliance disclosure)

⚠️

Call Analytics is priced separately from standard Transcribe and costs approximately 6x more per minute. Only use it for calls where you need the additional intelligence - use standard Transcribe for simple transcription-only requirements.

Transcribe Pricing

Feature	Price Per Second	Notes
Standard batch transcription	$0.00004/second ($0.24/minute)	Free tier: 60 minutes/month for 12 months
Medical transcription	$0.00075/second ($0.045/minute)	Optimized for clinical vocabulary, HIPAA eligible
Streaming transcription	$0.00004/second	Per second of audio streamed
Call Analytics batch	$0.0285/minute	Premium - includes all analytics features
Call Analytics streaming	$0.014/minute	Real-time analytics during live calls
Custom Language Model	$0.038/minute when CLM is used	Additional charge on top of base transcription

💡

Audio duration is rounded up to the nearest second. Transcribe bills on the actual audio duration, not the processing time. A 30-minute meeting recording billed at 1800 seconds x $0.00004 = $0.072.

🎯

Interview Focus Points

1What is the difference between batch and streaming transcription in Amazon Transcribe? When would you use each?
2How does speaker diarization work in Transcribe and what are its limitations?
3What is the difference between a Custom Vocabulary and a Custom Language Model in Transcribe?
4How would you build a contact center intelligence pipeline using Transcribe Call Analytics, Comprehend, and a data warehouse?
5How does Transcribe PII redaction work and what types of PII does it detect?
6How would you handle multilingual audio where the speaker language is unknown at recording time?
7What makes Transcribe Medical different from standard Transcribe?
8How would you use Transcribe streaming transcription to add live captions to a video conferencing application?