Ace Cloud Interviews
🤖

AWS AI & Machine Learning

Polly

Convert text to natural-sounding speech with neural TTS voices in 30+ languages

Amazon Polly is a text-to-speech service that converts text into natural-sounding audio using deep learning neural TTS technology, supporting 30+ languages and 90+ voices including both standard and neural voice engines. Polly is commonly used for voice interfaces, accessibility features, e-learning content, and IVR systems. For cloud engineers, understanding Polly means knowing how to integrate speech synthesis into serverless pipelines and manage the tradeoffs between voice quality and cost.

Standard vs Neural vs Long-Form vs Generative Voices

Polly offers four voice engine types with different quality levels and pricing.

EngineTechnologyQualityPrice (per million chars)Best For
StandardConcatenative synthesisGood - slight robotic quality$4.00High-volume, cost-sensitive workloads
NeuralNeural TTS (NTTS)Very good - natural prosody$16.00Customer-facing applications
Long-formNeural optimized for long contentExcellent - podcast/narration quality$100.00Audiobooks, e-learning, long narrations
GenerativeGenerative AI voice synthesisHuman-like$30.00Premium interactive experiences
💡

Not all voices are available in all engines. Always check voice availability for your target language and engine before committing to an architecture. The aws polly describe-voices CLI command lists all available combinations.

SSML - Fine-Grained Speech Control

Speech Synthesis Markup Language (SSML) is an XML-based language that lets you control pronunciation, rate, pitch, volume, pauses, and emphasis in Polly output. It is essential for production-quality TTS.

bash
<!-- SSML example controlling speech characteristics -->
<speak>
    Welcome to <emphasis level="strong">Ace Cloud Interviews</emphasis>.
    <break time="500ms"/>
    Today we will cover 
    <prosody rate="slow" pitch="+2st">Amazon SageMaker</prosody>.
    <break time="300ms"/>
    The first question is:
    <prosody volume="loud" rate="90%">
        How does SageMaker managed spot training handle interruptions?
    </prosody>
    <break time="1s"/>
    Before we answer, note that <sub alias="Amazon Simple Storage Service">S3</sub>
    is used to store training checkpoints.
</speak>
SSML TagPurposeExample
<break>Add silence/pause<break time="500ms"/> or <break strength="strong"/>
<emphasis>Stress words<emphasis level="strong">critical</emphasis>
<prosody>Control rate, pitch, volume<prosody rate="slow" pitch="+5st">text</prosody>
<sub>Pronunciation substitution<sub alias="Amazon S3">S3</sub>
<phoneme>Phonetic pronunciation<phoneme alphabet="ipa" ph="ˈti.bi.es">TBS</phoneme>
<lang>Switch language mid-speech<lang xml:lang="fr-FR">Bonjour</lang>
<speak>Root element wrapperAlways wraps SSML content

Common Architecture Patterns with Polly

Polly fits into several common serverless pipeline patterns.

PatternArchitectureUse Case
Synchronous TTSAPI call -> Polly SynthesizeSpeech -> stream audio bytes to clientShort texts (<3000 chars), real-time voice response
Async S3 pipelineLambda -> StartSpeechSynthesisTask -> SNS notify -> S3 MP3Long texts, background audio generation for articles
Content pipelineS3 text upload -> Lambda -> Polly -> S3 audio -> CloudFront CDNE-learning audio, podcast generation at scale
Lex + Polly voice botAmazon Lex handles NLU -> Polly renders bot responses as speechIVR systems, voice assistants
Accessibility layerPage text -> API Gateway -> Lambda -> Polly -> browser audio playerRead-aloud features for websites
bash
# Async synthesis task (for texts > 3000 characters)
import boto3

polly = boto3.client('polly')

response = polly.start_speech_synthesis_task(
    Engine='neural',
    LanguageCode='en-US',
    OutputFormat='mp3',
    OutputS3BucketName='my-audio-bucket',
    OutputS3KeyPrefix='articles/',
    Text=long_article_text,
    TextType='text',
    VoiceId='Joanna',
    SnsTopicArn='arn:aws:sns:us-east-1:123456789012:polly-complete'
)
task_id = response['SynthesisTask']['TaskId']
⚠️

SynthesizeSpeech (synchronous) has a 3000 character limit. Use StartSpeechSynthesisTask for longer content. The async task saves the MP3 directly to S3 and notifies via SNS when complete.

Polly Pricing and Cost Optimization

EnginePriceFree Tier
Standard voices$4.00 per 1M characters5M characters/month free for 12 months
Neural voices$16.00 per 1M characters1M characters/month free for 12 months
Long-form voices$100.00 per 1M charactersNo free tier
Generative voices$30.00 per 1M charactersNo free tier
💡

Characters are counted in the input text, including SSML tags when using TextType=ssml. A long SSML document with many tags can have significantly more characters than the visible text. Monitor character counts in production.

🎯

Interview Focus Points

  • 1What is the difference between Polly standard, neural, long-form, and generative voice engines?
  • 2What is SSML and why would you use it instead of plain text for Polly synthesis?
  • 3How would you build an automatic podcast generation pipeline using Polly for a content platform?
  • 4What is the character limit for synchronous Polly synthesis and how do you handle longer texts?
  • 5How would you integrate Polly with Amazon Lex to build a voice-based IVR system?
  • 6How would you optimize Polly costs for a high-volume article read-aloud feature?