How to Turn Text into Audio Content with Text-to-Speech Technology
From audiobooks to podcasts to Alexa briefings, audio content is everywhere. The next step? Turn your text into audio content with text-to-speech technology.

No time to read? Watch our video overview:

You may have already heard of Amazon Polly, the service behind the voice for Alexa. Its text-to-speech (TTS) technology takes your written text and turns it into streaming or downloadable audio content you can share with your audience. We took it for a test drive – and you can listen to our results below.

Text-to-speech technology will probably never be seamless. There are always quirks of rhythm and pronunciation that will trip up the AI. However, we want to tell you about this emerging technology now so you can think about how it might fit into your business. First, we’ll explain how it works and show you our samples. Then we’ll brainstorm ways you can use it. That’s the fun part, and we hope it inspires you to think about different ways to create, transform, and distribute information for your clients.

Jump to a section:

How TTS Works
Our Samples
Takeaways & Thoughts
7 Ideas for Using TTS Technology

How TTS Works

Text-to-speech technology like Amazon Polly takes written text and turns it into an audio file using a trained AI. The AI synthesizes a human voice and reads your text out loud. Until recently, most of them sounded like robots.

Enter Amazon Polly. The service is free to try for one year, up to 5 million characters per month. After your free year, pricing is on a pay-as-you-go model, where 1 million characters are $4.00 for standard voices and $16 for Neural (more advanced) voices. According to their estimates, a typical news article that's three pages long is about 6,500 characters, which would cost a whopping $0.03 for standard voicing.

All you need to access the service and give it a try is a free Amazon Web Services (AWS) account.

You can convert up to 3,000 characters at a time and download the results right away. If you want to convert more than 3,000 characters, you just have to create a “bucket” for storage in your AWS account so the service can process your request and deposit your audio file in your bucket. We started with fewer than 3,000 characters so everything was easy and instantly accessible.

Our Samples

Here’s what the interface looks like – it’s pretty simple. Paste your text, select your variables (quality, language, and voice), then listen or download.

Screenshot of the Amazon Polly interface for inputting text and selecting voice options

You can listen to a conversion as many times as you want, and you can switch voices, too. Amazon Polly offers male and female voices in a range of languages. For American English (as opposed to British English), you have five female and three male voices to choose from (Ivy and Justin are kids’ voices).

Screenshot of the Amazon Polly voice options for American English

First, we grabbed about 1,000 words from a previous blog post – we chose this one. Next, we pasted our text into the conversion window and hit “Listen to speech.” For our first pass, we used the standard quality rather than upgraded Neural quality as a baseline.

Then we listened to the output in every adult voice to see which was the smoothest. Surprisingly, there are human-like intonations in each one. Some had more inflection with questions, for example. The standard engine had a little trouble with tricky word combinations like “don’t want” – it stumbled a little on the “t” to “w” transition. The Neural engine handled it better.

Quick Fixes for Audio Text

Our last step was to smooth out the text we’d pasted to make it easier for the AI to read. For example, we noticed an unnatural emphasis in this sentence: “What kinds of things might they ask for?” You or I would put the emphasis on “ask,” but the AI put it on “for.” So we changed that sentence to “What kinds of things might they want?” Problem solved.

We also added transition sentences bookending the list of sample client responses in the source post. If you’re reading that post, the quotation marks and line spacing make it clear you’re seeing sample spoken answers from clients. But if you’re listening, that’s not so clear, hence the additional text.

Screenshot of the original post we used for our TTS sample, showing the visual cues audio listeners don't have, like quotation marks and line breaks.

Also, we noticed that the AI was reading our colloquial “Guess what?” as an actual question with a distinct emphasis on the second word. A real person would correctly interpret that as more of a transition statement than an actual question, so we changed the question mark to a period. It worked – the AI’s voice now read this with a slight emphasis on the first word instead of a strong emphasis on the second.

The Results

After we made those tweaks, we downloaded MP3s of our favorite female voice (Joanna) and our favorite male voice (Matt) in both engines to see if Neural outperformed standard. Here are the two standard samples:

Listen to the "Joanna" female voice Listen to the "Matthew" male voice

What do you think?

As we mentioned, they’re not perfect. You can still hear some rough bits where the AI stumbles a bit. The emphasis is in the wrong place in the word “grandkids” and the phrase “stock market,” for example.

So we re-did the samples in the more advanced Neural engine. Here we go again:

Listen to the "Joanna" female voice Listen to the "Matthew" male voice

Much better, right? “Grandkids” came out much smoother, and “stock market” isn’t perfect, but it’s better!

Takeaways & Thoughts

Okay, so this AI stuff is great and all, but what are we actually supposed to do with it?

People are using Amazon Polly to turn their blogs and books into podcasts. If the word “podcast” makes you think of a broadcast with a huge following, keep in mind that not all podcasts are intended for mass distribution. Plenty of podcasts have limited runs, minimal or custom distribution, and are intended for a small audience.

Position TTS as a bonus—you’re using cutting-edge technology to bring your words to your listeners!

We listened to one eBook-to-podcast conversion created using Amazon Polly. Each episode included a brief introduction from the author explaining what the eBook is about. In addition, each intro also pointed out that the podcast was created using text-to-speech technology. You can even position that as a bonus—you’re using cutting-edge technology to bring your words to your listeners!

Using text-to-speech does have one big drawback: it’s not perfect. But maybe this is one of those cases where “perfect” is the enemy of “done.” Are these AI voice-overs good enough? Does it get the information into the hands (er, ears) of your clients and prospects in a comprehensible way? Is it a good way to test the waters to see if your audience responds to audio content? If your answer is “yes” or even “maybe,” browse the list below for ideas on what you can do with this technology.

7 Ideas for Using TTS Technology

Okay, here’s the fun part. We started brainstorming ways you can use text-to-speech technology to expand your content’s reach and better serve your clients and prospects.

1. Turn your blog or eBook into a podcast

Like the idea of having a podcast but aren’t sure you’re ready to commit the time or resources? Test the waters by turning some of your most popular or useful posts into audio files, then use a free podcasting tool like Anchor to upload them. Anchor handles distribution to major podcatchers for you, so it’s the easiest, most hands-free way to test the waters.

2. Turn your eBook into an audiobook

Do you have a free eBook you offer as a lead magnet? Why not turn it into an audiobook and offer that version, too?

3. Turn your website content into a podcast or audiobook

Your website content probably explains what life insurance is, who needs it, and what kinds are available. Why not turn that content into a podcast or audiobook? Alternatively, it could be the starting point for a YouTube video series.

4. Turn your newsletter into a podcast

Do you email your clients on a regular basis with a newsletter? You could turn it into a podcast using Amazon Polly and Anchor (the free podcast distribution tool mentioned in idea #1). For clients who prefer to listen rather than read, this ensures your info gets to them in the medium they prefer.

5. Create audio versions of your web content

Many visually impaired folks already use screen readers to help them browse the web. But those screen reader voices are often pretty robotic. Using Amazon Polly, you can post audio versions of your web pages for anyone who wants to listen instead of read. That includes people with visual impairments, people who don’t read well, or people who want the audio/visual experience to fully absorb the details of what they’re reading. If your site is in WordPress, there’s a plugin for that will automatically create audio recordings in Amazon Polly for new content after you publish.

6. Give chatbots on your site a voice

Do you have automated chatbots on your website? If so, you can connect them to Amazon Polly and get TTS voicing on the fly.

7. Turn presentations into podcasts or videos for YouTube

Do you have PowerPoint presentations with text? Copy that text, paste it into Amazon Polly, and create an audio narration. You can even control the timing of the narration with a simple SSML tag insertion (<break time="3s"/>). Next, add that audio as an MP3 file to your presentation. Finally, export your slideshow as a video for YouTube or social sharing.

That’s our look at how to turn text into audio content with text-to-speech technology!

Does this technology scare you? Excite you? Give you ideas on repurposing your existing content? Tell us all about it in the comments.