Polly

Convert text to natural-sounding speech with neural TTS voices in 30+ languages

Amazon Polly is a text-to-speech service that converts text into natural-sounding audio using deep learning neural TTS technology, supporting 30+ languages and 90+ voices including both standard and neural voice engines. Polly is commonly used for voice interfaces, accessibility features, e-learning content, and IVR systems. For cloud engineers, understanding Polly means knowing how to integrate speech synthesis into serverless pipelines and manage the tradeoffs between voice quality and cost.

Standard vs Neural vs Long-Form vs Generative Voices

Polly offers four voice engine types with different quality levels and pricing.

Engine	Technology	Quality	Price (per million chars)	Best For
Standard	Concatenative synthesis	Good - slight robotic quality	$4.00	High-volume, cost-sensitive workloads
Neural	Neural TTS (NTTS)	Very good - natural prosody	$16.00	Customer-facing applications
Long-form	Neural optimized for long content	Excellent - podcast/narration quality	$100.00	Audiobooks, e-learning, long narrations
Generative	Generative AI voice synthesis	Human-like	$30.00	Premium interactive experiences

💡

Not all voices are available in all engines. Always check voice availability for your target language and engine before committing to an architecture. The aws polly describe-voices CLI command lists all available combinations.

SSML - Fine-Grained Speech Control

Speech Synthesis Markup Language (SSML) is an XML-based language that lets you control pronunciation, rate, pitch, volume, pauses, and emphasis in Polly output. It is essential for production-quality TTS.

bash

<!-- SSML example controlling speech characteristics -->
<speak>
    Welcome to <emphasis level="strong">Ace Cloud Interviews</emphasis>.
    <break time="500ms"/>
    Today we will cover 
    <prosody rate="slow" pitch="+2st">Amazon SageMaker</prosody>.
    <break time="300ms"/>
    The first question is:
    <prosody volume="loud" rate="90%">
        How does SageMaker managed spot training handle interruptions?
    </prosody>
    <break time="1s"/>
    Before we answer, note that <sub alias="Amazon Simple Storage Service">S3</sub>
    is used to store training checkpoints.
</speak>

SSML Tag	Purpose	Example
<break>	Add silence/pause	<break time="500ms"/> or <break strength="strong"/>
<emphasis>	Stress words	<emphasis level="strong">critical</emphasis>
<prosody>	Control rate, pitch, volume	<prosody rate="slow" pitch="+5st">text</prosody>
<sub>	Pronunciation substitution	<sub alias="Amazon S3">S3</sub>
<phoneme>	Phonetic pronunciation	<phoneme alphabet="ipa" ph="ˈti.bi.es">TBS</phoneme>
<lang>	Switch language mid-speech	<lang xml:lang="fr-FR">Bonjour</lang>
<speak>	Root element wrapper	Always wraps SSML content

Common Architecture Patterns with Polly

Polly fits into several common serverless pipeline patterns.

Pattern	Architecture	Use Case
Synchronous TTS	API call -> Polly SynthesizeSpeech -> stream audio bytes to client	Short texts (<3000 chars), real-time voice response
Async S3 pipeline	Lambda -> StartSpeechSynthesisTask -> SNS notify -> S3 MP3	Long texts, background audio generation for articles
Content pipeline	S3 text upload -> Lambda -> Polly -> S3 audio -> CloudFront CDN	E-learning audio, podcast generation at scale
Lex + Polly voice bot	Amazon Lex handles NLU -> Polly renders bot responses as speech	IVR systems, voice assistants
Accessibility layer	Page text -> API Gateway -> Lambda -> Polly -> browser audio player	Read-aloud features for websites

bash

# Async synthesis task (for texts > 3000 characters)
import boto3

polly = boto3.client('polly')

response = polly.start_speech_synthesis_task(
    Engine='neural',
    LanguageCode='en-US',
    OutputFormat='mp3',
    OutputS3BucketName='my-audio-bucket',
    OutputS3KeyPrefix='articles/',
    Text=long_article_text,
    TextType='text',
    VoiceId='Joanna',
    SnsTopicArn='arn:aws:sns:us-east-1:123456789012:polly-complete'
)
task_id = response['SynthesisTask']['TaskId']

⚠️

SynthesizeSpeech (synchronous) has a 3000 character limit. Use StartSpeechSynthesisTask for longer content. The async task saves the MP3 directly to S3 and notifies via SNS when complete.

Polly Pricing and Cost Optimization

Engine	Price	Free Tier
Standard voices	$4.00 per 1M characters	5M characters/month free for 12 months
Neural voices	$16.00 per 1M characters	1M characters/month free for 12 months
Long-form voices	$100.00 per 1M characters	No free tier
Generative voices	$30.00 per 1M characters	No free tier

💡

Characters are counted in the input text, including SSML tags when using TextType=ssml. A long SSML document with many tags can have significantly more characters than the visible text. Monitor character counts in production.

🎯

Interview Focus Points

1What is the difference between Polly standard, neural, long-form, and generative voice engines?
2What is SSML and why would you use it instead of plain text for Polly synthesis?
3How would you build an automatic podcast generation pipeline using Polly for a content platform?
4What is the character limit for synchronous Polly synthesis and how do you handle longer texts?
5How would you integrate Polly with Amazon Lex to build a voice-based IVR system?
6How would you optimize Polly costs for a high-volume article read-aloud feature?