50 AI Voices: Which Reigns Supreme?

Text-to-speech (TTS) tech is getting crazy good, and I’m on a mission to compare every voice available to me. Let’s start with three I’ve been using lately:

  • Amazon Polly, historically a quick, cheap workhorse for informational speech.

  • ElevenLabs, which has led on quality and expressiveness.

  • OpenAI Text-to-Speech, six brand-spanking-new voices from the creators of ChatGPT!

Here’s as exhaustive a survey of all* the voices for all these 3 services, posted here because someone suggested I stick ‘em all in one place (this one’s for you, Ethan!). Scroll on down if you just want to hear the samples.

(*Amazon Polly neural voices, ElevenLabs’s base voices, and OpenAI’s 6 new ones.)

First: A Bit of History

Text-to-speech (TTS) tech for consumers has come a long way. I remember playing with early attempts—the TI 99/4A had an add-on hardware module that offered pretty basic, robotic speech:

With the Alien Voice Box, the Atari 800 offered an improvement:

Flash forward a handful of decades, and by the 2010s, TTS voices had improved a ton. Here's the original Apple Siri commercial:

As a game developer who’s done a lot of work with procedural/dynamic narrative, I’m excited about the prospect of characters giving performances that won’t break the player out of the moment. Here in the far future of 2023, AI/ML TTS offers more varied voices and more natural, expressive speech:

ElevenLabs - Elli

While not perfect, that’s not too darned shabby for a computer. This was ElevenLabs, fed with straight text—no information about delivery or nothin’.

Three TTS Services

I picked these services because they’re reasonably-priced, robust, and have have APIs for programmatic speech generation. They’re all speaking the same phrase:

Welcome, weary traveler, to the tender embrace of tonight's moonlight serenade, a soothing lullaby caressed by the whispers of distant crickets. The veil of the night is drawn, and the world is drenched in an iridescent silver hue. The moon, a radiant orb, hangs in the sapphire sky, casting shadows and illusions while weaving a tapestry of tranquility.

That’s a kinda-nonsense ASMR, courtesy of ChatGPT-4.

First: Amazon Polly Neural

1. Amazon Polly - It’s inexpensive at $16 per million characters, and it’s quick—each of the following took less than a second to return an OGG. I’m exclusively using their “Neural” voices, which they claim to be their best-quality ones:

Amazon Polly - Amy

The voice stumbles on the intro (“Welcome… Weary traveler…”), but is otherwise not too bad. Here are the rest of their English language Neural voices:

Amazon Polly - Aria
Amazon Polly - Arthur
Amazon Polly - Ayanda
Amazon Polly - Brian
Amazon Polly - Danielle
Amazon Polly - Emma
Amazon Polly - Gregory
Amazon Polly - Ivy
Amazon Polly - Joanna
Amazon Polly - Joey
Amazon Polly - Justin
Amazon Polly - Kajal
Amazon Polly - Kendra
Amazon Polly - Kevin
Amazon Polly - Kimberly
Amazon Polly - Matthew
Amazon Polly - Niamh
Amazon Polly - Olivia
Amazon Polly - Ruth
Amazon Polly - Salli
Amazon Polly - Stephen

I like Polly. It’s quick—it took less than a half second round-trip for each of those. It’s historically been one of the cheaper options, at $0.000016/character input. And the voice isn’t super expressive, but it’s solid enough to get a point across.

I’d use Amazon Polly whenever I wanted to generate a ton of informational text quickly.

Second: OpenAI’s Six New Voices

Hot off the presses, OpenAI, creators of ChatGPT, have their first batch of a half dozen voices:

OpenAI - Alloy
OpenAI - Echo
OpenAI - Fable
OpenAI - Nova
OpenAI - Onyx
OpenAI - Shimmer

They’ve priced this at $0.015 per 1k characters. To my ears, these sound pretty good. We're still getting a weird pause after "Welcome," but each of those has a solid cadence, and Fable even does a passable Daniel Radcliffe.

They still strike me as being most appropriate for narrative/information—here’s their Fable voice with the line that ElevenLabs Elli did, above:

OpenAI - Fable is Furious

Fable sounds like he’s reading a script about being furious, but is not actually very angry at all.

Third: ElevenLabs

ElevenLabs is my pick for the most expressive of the three, with the most natural/interesting delivery and the greatest breadth of voices:

Elevenlabs - Adam

Right off the bat, the initial pause in “Welcome, weary traveler…” is more natural.

Elevenlabs - Antoni
Elevenlabs - Arnold
Elevenlabs - Bella
Elevenlabs - Callum
Elevenlabs - Charlie
Elevenlabs - Charlotte
Elevenlabs - Clyde
Elevenlabs - Daniel
Elevenlabs - Dave
Elevenlabs - Domi
Elevenlabs - Dorothy
Elevenlabs - Elli
Elevenlabs - Emily
Elevenlabs - Ethan
Elevenlabs - Fin

Fin is particularly expressive. There’s movement in both pitch and volume, the pauses are at the right places, and you can even hear breaths. There’s some artifacting in the audio, but overall, this is fantastic.

Elevenlabs - Freya
Elevenlabs - Gigi
Elevenlabs - Giovanni

As with Fin, there’s lots of expression in Giovanni’s performance.

Elevenlabs - Glinda
Elevenlabs - Grace

There’s much variation in style and quality among the ElevenLabs voices—Grace is kinda reading this like a laundry list.

Elevenlabs - Harry
Elevenlabs - James
Elevenlabs - Jeremy
Elevenlabs - Jessie
Elevenlabs - Joseph
Elevenlabs - Josh
Elevenlabs - Liam
Elevenlabs - Matilda
Elevenlabs - Matthew

I love the deep resonance of the Michael voice. I’d listen to an audiobook of this.

Elevenlabs - Michael
Elevenlabs - Mimi
Elevenlabs - Nicole

There’s some noise here, but I like that Nicole’s doing a whispered ASMR thing. It shows ElevenLabs’s range.

Elevenlabs - Patrick
Elevenlabs - Rachel
Elevenlabs - Sam
Elevenlabs - Serena

Serena sounds extremely overdriven; I’d consider this one straight up unusable.

Elevenlabs - Thomas

Not all of the voices are great (Josh, Rachel, and Grace are hamfistedly reading from a script, and Freya is too harsh), and that comma after "Welcome" still trips some of them up. But they all seem to have different personalities. Fin is giving it everything he's got—there's a TON of expression in there. Same with Giovanni.

The pricing structure is different than with Polly or ElevenLabs—for the above, I’m using their $22/mo subscription, which allows me to generate 100k characters in a month. That’s $0.00022—about 14x as expensive as Polly or OpenAI. And it’s slower, too—it took around 2s to generate each of those. But to my ears, the highest-quality, most expressive of the bunch.

And One More…

Elevenlabs - Sports Announcer

ElevenLabs allows you do roll the dice and create a voice at random. To me, this one sounds like Phil Hartman as a Midwestern sports announcer attempting an English accent.

Previous
Previous

Push Button, Generate Audiobook: The AI Storyteller

Next
Next

Rapping About What’s On Steam