AWS AI & Machine Learning
Polly
Convert text to natural-sounding speech with neural TTS voices in 30+ languages
Amazon Polly is a text-to-speech service that converts text into natural-sounding audio using deep learning neural TTS technology, supporting 30+ languages and 90+ voices including both standard and neural voice engines. Polly is commonly used for voice interfaces, accessibility features, e-learning content, and IVR systems. For cloud engineers, understanding Polly means knowing how to integrate speech synthesis into serverless pipelines and manage the tradeoffs between voice quality and cost.
Standard vs Neural vs Long-Form vs Generative Voices
Polly offers four voice engine types with different quality levels and pricing.
| Engine | Technology | Quality | Price (per million chars) | Best For |
|---|---|---|---|---|
| Standard | Concatenative synthesis | Good - slight robotic quality | $4.00 | High-volume, cost-sensitive workloads |
| Neural | Neural TTS (NTTS) | Very good - natural prosody | $16.00 | Customer-facing applications |
| Long-form | Neural optimized for long content | Excellent - podcast/narration quality | $100.00 | Audiobooks, e-learning, long narrations |
| Generative | Generative AI voice synthesis | Human-like | $30.00 | Premium interactive experiences |
Not all voices are available in all engines. Always check voice availability for your target language and engine before committing to an architecture. The aws polly describe-voices CLI command lists all available combinations.
SSML - Fine-Grained Speech Control
Speech Synthesis Markup Language (SSML) is an XML-based language that lets you control pronunciation, rate, pitch, volume, pauses, and emphasis in Polly output. It is essential for production-quality TTS.
<!-- SSML example controlling speech characteristics -->
<speak>
Welcome to <emphasis level="strong">Ace Cloud Interviews</emphasis>.
<break time="500ms"/>
Today we will cover
<prosody rate="slow" pitch="+2st">Amazon SageMaker</prosody>.
<break time="300ms"/>
The first question is:
<prosody volume="loud" rate="90%">
How does SageMaker managed spot training handle interruptions?
</prosody>
<break time="1s"/>
Before we answer, note that <sub alias="Amazon Simple Storage Service">S3</sub>
is used to store training checkpoints.
</speak>
| SSML Tag | Purpose | Example |
|---|---|---|
| <break> | Add silence/pause | <break time="500ms"/> or <break strength="strong"/> |
| <emphasis> | Stress words | <emphasis level="strong">critical</emphasis> |
| <prosody> | Control rate, pitch, volume | <prosody rate="slow" pitch="+5st">text</prosody> |
| <sub> | Pronunciation substitution | <sub alias="Amazon S3">S3</sub> |
| <phoneme> | Phonetic pronunciation | <phoneme alphabet="ipa" ph="ˈti.bi.es">TBS</phoneme> |
| <lang> | Switch language mid-speech | <lang xml:lang="fr-FR">Bonjour</lang> |
| <speak> | Root element wrapper | Always wraps SSML content |
Common Architecture Patterns with Polly
Polly fits into several common serverless pipeline patterns.
| Pattern | Architecture | Use Case |
|---|---|---|
| Synchronous TTS | API call -> Polly SynthesizeSpeech -> stream audio bytes to client | Short texts (<3000 chars), real-time voice response |
| Async S3 pipeline | Lambda -> StartSpeechSynthesisTask -> SNS notify -> S3 MP3 | Long texts, background audio generation for articles |
| Content pipeline | S3 text upload -> Lambda -> Polly -> S3 audio -> CloudFront CDN | E-learning audio, podcast generation at scale |
| Lex + Polly voice bot | Amazon Lex handles NLU -> Polly renders bot responses as speech | IVR systems, voice assistants |
| Accessibility layer | Page text -> API Gateway -> Lambda -> Polly -> browser audio player | Read-aloud features for websites |
# Async synthesis task (for texts > 3000 characters)
import boto3
polly = boto3.client('polly')
response = polly.start_speech_synthesis_task(
Engine='neural',
LanguageCode='en-US',
OutputFormat='mp3',
OutputS3BucketName='my-audio-bucket',
OutputS3KeyPrefix='articles/',
Text=long_article_text,
TextType='text',
VoiceId='Joanna',
SnsTopicArn='arn:aws:sns:us-east-1:123456789012:polly-complete'
)
task_id = response['SynthesisTask']['TaskId']
SynthesizeSpeech (synchronous) has a 3000 character limit. Use StartSpeechSynthesisTask for longer content. The async task saves the MP3 directly to S3 and notifies via SNS when complete.
Polly Pricing and Cost Optimization
| Engine | Price | Free Tier |
|---|---|---|
| Standard voices | $4.00 per 1M characters | 5M characters/month free for 12 months |
| Neural voices | $16.00 per 1M characters | 1M characters/month free for 12 months |
| Long-form voices | $100.00 per 1M characters | No free tier |
| Generative voices | $30.00 per 1M characters | No free tier |
Characters are counted in the input text, including SSML tags when using TextType=ssml. A long SSML document with many tags can have significantly more characters than the visible text. Monitor character counts in production.
Interview Focus Points
- 1What is the difference between Polly standard, neural, long-form, and generative voice engines?
- 2What is SSML and why would you use it instead of plain text for Polly synthesis?
- 3How would you build an automatic podcast generation pipeline using Polly for a content platform?
- 4What is the character limit for synchronous Polly synthesis and how do you handle longer texts?
- 5How would you integrate Polly with Amazon Lex to build a voice-based IVR system?
- 6How would you optimize Polly costs for a high-volume article read-aloud feature?