Ace Cloud Interviews
Home/AWS Tutorial/Transcribe
🤖

AWS AI & Machine Learning

Transcribe

Automatic speech recognition to convert audio and video to text

Amazon Transcribe is a fully managed automatic speech recognition (ASR) service that converts audio and video files - or real-time audio streams - to text using deep learning models trained on a wide variety of speech patterns. It supports 100+ languages, speaker diarization, custom vocabulary, content redaction, and call analytics. For cloud engineers, Transcribe is the foundation of voice-driven data pipelines, accessibility solutions, meeting transcription systems, and contact center analytics platforms.

Batch Transcription vs Streaming Transcription

Transcribe operates in two fundamental modes - batch (for recorded audio files) and streaming (for live audio). Each has distinct API surfaces and use cases.

AspectBatch TranscriptionStreaming Transcription
APIStartTranscriptionJob (async)WebSocket or HTTP/2 streaming
InputS3 object (MP3, MP4, WAV, FLAC, OGG, AMR, WebM)Real-time PCM or FLAC audio stream
OutputJSON file written to S3Partial and final results in WebSocket messages
LatencyMinutes to hours depending on file lengthHundreds of milliseconds (partial results faster)
Max duration4 hours per jobUnlimited (session-based)
Speaker diarizationYes - detect up to 10 speakersYes (single channel, up to 2 speakers)
Use caseMeeting recordings, podcast transcription, media archivesLive captions, real-time call centers, voice commands
bash
# Start a batch transcription job
import boto3

transcribe = boto3.client('transcribe', region_name='us-east-1')

response = transcribe.start_transcription_job(
    TranscriptionJobName='meeting-2024-01-15',
    LanguageCode='en-US',
    MediaFormat='mp4',
    Media={'MediaFileUri': 's3://my-recordings/meeting-2024-01-15.mp4'},
    OutputBucketName='my-transcripts',
    OutputKey='transcripts/meeting-2024-01-15.json',
    Settings={
        'ShowSpeakerLabels': True,
        'MaxSpeakerLabels': 4,
        'ShowAlternatives': False
    }
)
print(response['TranscriptionJob']['TranscriptionJobStatus'])

Key Features for Production Deployments

FeatureHow to EnableUse Case
Custom VocabularyVocabularyName in StartTranscriptionJobImprove accuracy for domain-specific terms, brand names, acronyms
Custom Language Model (CLM)ModelSettings.LanguageModelNameFine-tuned model on domain-specific text data
Speaker DiarizationSettings.ShowSpeakerLabels = trueAttribute text to different speakers in multi-party calls
PII RedactionContentRedaction.RedactionType = PIIReplace PII (SSN, DOB, address) with [PII] in transcript
Vocabulary FilteringSettings.VocabularyFilterNameRemove or tag profanity or sensitive words in output
Custom PronunciationVocabularyFilterMethodCorrect misheard product names or technical terms
Automatic Language IdentificationIdentifyLanguage = trueTranscribe multilingual calls without knowing the language
Toxicity DetectionSettings.ToxicityDetectionFlag toxic speech in call center transcripts
💡

Custom Language Models (CLM) provide significantly better accuracy than custom vocabulary alone for specialized domains (medical, legal, technical). You provide a text corpus of domain content, Transcribe trains the adaptation model. CLM training takes 6-10 hours.

Transcribe Call Analytics - Contact Center Intelligence

Transcribe Call Analytics is a specialized variant of Transcribe designed specifically for contact center recordings. It adds call-specific intelligence on top of raw transcription.

Call Analytics FeatureWhat It Provides
Turn-by-turn sentimentSentiment score (positive/negative/neutral) per speaker utterance, not just overall
Issue detectionAutomatically detect what the call was about (billing issue, technical problem, etc.)
Action itemsExtract commitments made during the call ("I will email you by Friday")
Call characteristicsNon-talk time, interruptions, loudness, talk speed per speaker
Agent/Customer labelsCorrect speaker label assignment for two-party calls
Call categoriesRule-based event tagging (e.g., flag if agent did not say required compliance disclosure)
⚠️

Call Analytics is priced separately from standard Transcribe and costs approximately 6x more per minute. Only use it for calls where you need the additional intelligence - use standard Transcribe for simple transcription-only requirements.

Transcribe Pricing

FeaturePrice Per SecondNotes
Standard batch transcription$0.00004/second ($0.24/minute)Free tier: 60 minutes/month for 12 months
Medical transcription$0.00075/second ($0.045/minute)Optimized for clinical vocabulary, HIPAA eligible
Streaming transcription$0.00004/secondPer second of audio streamed
Call Analytics batch$0.0285/minutePremium - includes all analytics features
Call Analytics streaming$0.014/minuteReal-time analytics during live calls
Custom Language Model$0.038/minute when CLM is usedAdditional charge on top of base transcription
💡

Audio duration is rounded up to the nearest second. Transcribe bills on the actual audio duration, not the processing time. A 30-minute meeting recording billed at 1800 seconds x $0.00004 = $0.072.

🎯

Interview Focus Points

  • 1What is the difference between batch and streaming transcription in Amazon Transcribe? When would you use each?
  • 2How does speaker diarization work in Transcribe and what are its limitations?
  • 3What is the difference between a Custom Vocabulary and a Custom Language Model in Transcribe?
  • 4How would you build a contact center intelligence pipeline using Transcribe Call Analytics, Comprehend, and a data warehouse?
  • 5How does Transcribe PII redaction work and what types of PII does it detect?
  • 6How would you handle multilingual audio where the speaker language is unknown at recording time?
  • 7What makes Transcribe Medical different from standard Transcribe?
  • 8How would you use Transcribe streaming transcription to add live captions to a video conferencing application?