AWS AI & Machine Learning
Transcribe
Automatic speech recognition to convert audio and video to text
Amazon Transcribe is a fully managed automatic speech recognition (ASR) service that converts audio and video files - or real-time audio streams - to text using deep learning models trained on a wide variety of speech patterns. It supports 100+ languages, speaker diarization, custom vocabulary, content redaction, and call analytics. For cloud engineers, Transcribe is the foundation of voice-driven data pipelines, accessibility solutions, meeting transcription systems, and contact center analytics platforms.
Batch Transcription vs Streaming Transcription
Transcribe operates in two fundamental modes - batch (for recorded audio files) and streaming (for live audio). Each has distinct API surfaces and use cases.
| Aspect | Batch Transcription | Streaming Transcription |
|---|---|---|
| API | StartTranscriptionJob (async) | WebSocket or HTTP/2 streaming |
| Input | S3 object (MP3, MP4, WAV, FLAC, OGG, AMR, WebM) | Real-time PCM or FLAC audio stream |
| Output | JSON file written to S3 | Partial and final results in WebSocket messages |
| Latency | Minutes to hours depending on file length | Hundreds of milliseconds (partial results faster) |
| Max duration | 4 hours per job | Unlimited (session-based) |
| Speaker diarization | Yes - detect up to 10 speakers | Yes (single channel, up to 2 speakers) |
| Use case | Meeting recordings, podcast transcription, media archives | Live captions, real-time call centers, voice commands |
# Start a batch transcription job
import boto3
transcribe = boto3.client('transcribe', region_name='us-east-1')
response = transcribe.start_transcription_job(
TranscriptionJobName='meeting-2024-01-15',
LanguageCode='en-US',
MediaFormat='mp4',
Media={'MediaFileUri': 's3://my-recordings/meeting-2024-01-15.mp4'},
OutputBucketName='my-transcripts',
OutputKey='transcripts/meeting-2024-01-15.json',
Settings={
'ShowSpeakerLabels': True,
'MaxSpeakerLabels': 4,
'ShowAlternatives': False
}
)
print(response['TranscriptionJob']['TranscriptionJobStatus'])
Key Features for Production Deployments
| Feature | How to Enable | Use Case |
|---|---|---|
| Custom Vocabulary | VocabularyName in StartTranscriptionJob | Improve accuracy for domain-specific terms, brand names, acronyms |
| Custom Language Model (CLM) | ModelSettings.LanguageModelName | Fine-tuned model on domain-specific text data |
| Speaker Diarization | Settings.ShowSpeakerLabels = true | Attribute text to different speakers in multi-party calls |
| PII Redaction | ContentRedaction.RedactionType = PII | Replace PII (SSN, DOB, address) with [PII] in transcript |
| Vocabulary Filtering | Settings.VocabularyFilterName | Remove or tag profanity or sensitive words in output |
| Custom Pronunciation | VocabularyFilterMethod | Correct misheard product names or technical terms |
| Automatic Language Identification | IdentifyLanguage = true | Transcribe multilingual calls without knowing the language |
| Toxicity Detection | Settings.ToxicityDetection | Flag toxic speech in call center transcripts |
Custom Language Models (CLM) provide significantly better accuracy than custom vocabulary alone for specialized domains (medical, legal, technical). You provide a text corpus of domain content, Transcribe trains the adaptation model. CLM training takes 6-10 hours.
Transcribe Call Analytics - Contact Center Intelligence
Transcribe Call Analytics is a specialized variant of Transcribe designed specifically for contact center recordings. It adds call-specific intelligence on top of raw transcription.
| Call Analytics Feature | What It Provides |
|---|---|
| Turn-by-turn sentiment | Sentiment score (positive/negative/neutral) per speaker utterance, not just overall |
| Issue detection | Automatically detect what the call was about (billing issue, technical problem, etc.) |
| Action items | Extract commitments made during the call ("I will email you by Friday") |
| Call characteristics | Non-talk time, interruptions, loudness, talk speed per speaker |
| Agent/Customer labels | Correct speaker label assignment for two-party calls |
| Call categories | Rule-based event tagging (e.g., flag if agent did not say required compliance disclosure) |
Call Analytics is priced separately from standard Transcribe and costs approximately 6x more per minute. Only use it for calls where you need the additional intelligence - use standard Transcribe for simple transcription-only requirements.
Transcribe Pricing
| Feature | Price Per Second | Notes |
|---|---|---|
| Standard batch transcription | $0.00004/second ($0.24/minute) | Free tier: 60 minutes/month for 12 months |
| Medical transcription | $0.00075/second ($0.045/minute) | Optimized for clinical vocabulary, HIPAA eligible |
| Streaming transcription | $0.00004/second | Per second of audio streamed |
| Call Analytics batch | $0.0285/minute | Premium - includes all analytics features |
| Call Analytics streaming | $0.014/minute | Real-time analytics during live calls |
| Custom Language Model | $0.038/minute when CLM is used | Additional charge on top of base transcription |
Audio duration is rounded up to the nearest second. Transcribe bills on the actual audio duration, not the processing time. A 30-minute meeting recording billed at 1800 seconds x $0.00004 = $0.072.
Interview Focus Points
- 1What is the difference between batch and streaming transcription in Amazon Transcribe? When would you use each?
- 2How does speaker diarization work in Transcribe and what are its limitations?
- 3What is the difference between a Custom Vocabulary and a Custom Language Model in Transcribe?
- 4How would you build a contact center intelligence pipeline using Transcribe Call Analytics, Comprehend, and a data warehouse?
- 5How does Transcribe PII redaction work and what types of PII does it detect?
- 6How would you handle multilingual audio where the speaker language is unknown at recording time?
- 7What makes Transcribe Medical different from standard Transcribe?
- 8How would you use Transcribe streaming transcription to add live captions to a video conferencing application?