Ichiro Lambe 12/6/23 Ichiro Lambe 12/6/23

Push Button, Generate Audiobook: The AI Storyteller

Pulp Sci-Fi for Bedtime

I like reading pulp sci-fi before I sleep. There's a Ballmer Peak equivalent of interestingness I enjoy for that—a certain middling level of sophistication for bedtime stories. If they're too poorly-written, I'm taken out of the story. If they're too interesting, they prevent me from falling asleep.

So, every so often, I'll find an author that suits, and I'll pick up all of their works until I run out, then I'll have to find someone else. Rinse, repeat. What I really want is a Bach Faucet of pulp sci-fi to help me nod off. An audiobook would be even better! So, an experiment:

Can I create a tool where I can push a button and get a decent sci-fi short story audiobook out?

Let’s find out!

ChatGPT is a Naturally Bad Writer

If you're reading this, you probably already know that (as of December of 2023), ChatGPT is generally a poor storyteller. If you ask it, point-blank, to write you a short, it'll produce something trite and saccharine, with an AI smell to it. It even knows that it writes junk. I asked it to write a story and then critique it, and it was pretty spot-on:

Lack of Originality and Depth in Concept: Failing to provide a unique or fresh perspective on familiar sci-fi themes, and not exploring the deeper implications or ethical questions raised by the story's premise.
Underdeveloped Characters: Creating characters, including protagonists and supporting roles, that lack depth, personality, or relatability, making them forgettable or mere plot devices rather than integral parts of the story.
Weak World-Building: Constructing a generic or poorly detailed setting that lacks imagination and specificity, failing to immerse the reader in a believable and intriguing world.
Predictable Plot and Uneven Pacing: Crafting a story that is predictable, lacks conflict or tension, and does not effectively manage the pacing to keep the reader engaged or invested in the outcome.
Heavy-Handed Theming and Lack of Subtlety: Over-explaining or overtly stating themes and messages, thereby undermining the story’s ability to provoke thought or allow readers to draw their own conclusions. This often includes telling rather than showing, which can diminish the reader's experience and engagement.

ChatGPT knows it can’t write well!

So, how can we use its own knowledge to improve it?

Making ChatGPT a Better Storyteller

I ultimately want to use gpt-4 via API, but it's often handy to prototype with ChatGPT first. It's well-known that these tools produce better output when you first ask it to think about the whole process and then embarking on the actual creation.

So, for any X thing you want it to generate, you can improve its output by doing this:

Ask it how to structure an X in general.
Ask it to outline the specific X you're looking for.
Have it use that info to then flesh out the X.

For a short story that's good enough to fall asleep to, I might first ask it this:

Please outline for me the tenets of an award-winning sci-fi short story.

I could provide the rules, myself, but I’m a lazy human, and ChatGPT already knows the answer. It’s not even important that I actually read the results—I just want it to generate a set of steps it can refer back to. LLMs know a ton of stuff, but often need a nudge to bring this into their working memory. Following that, I ask it to ideate:

Can you please list for me about 10 INCREASINGLY INTERESTING ideas for a story in that vein? Make each one MORE INTERESTING AND UNUSUAL than the previous. AVOID OVERUSED SCI-FI TROPES.

The first idea might be a dud, but by the 10th, it may have generated something interesting. Since it's often better at identifying good and bad things than writing them, I can ask it to pick one of the ideas (again, without my direct involvement) and outline a story based on that:

Pick the most interesting of those, then outline a short sci-fi story that will be in 5 parts, according to all the tenets you've come up with so far. Put a twist into part 5, and make sure it's earned and transformative.

And then I just ask it to create each part:

Please write Part 1. This should be about 1,000 words long. Don't include titles or headers.

Rinse, repeat, and it generates a story.

Again, if you're familiar with this stuff, you’ll there's room for optimization in there—instead of asking it to "outline for me the tenets of an award-winning sci-fi short story," I could provide the rules explicitly. The neat thing about this approach, though, is that I can specify a new genre ("outline for me the tenets of a rollicking 19th century adventure"), and it’ll build everything my having to do anything else.

What Narrator?

I often use Amazon Polly, Eleven Labs, and OpenAI's text-to-speech. I did a deep dive into all of their voices in an earlier post (50 AI Voices: Which Reigns Supreme?), but TL;DR:

Amazon Polly is quick and affordable, but lacks expressive depth:

Eleven Labs offers richer voices and a bunch of emotion. Here are some of my favorites:

However, it tends to be significantly more expensive than Polly.

In true Goldilocks fashion, OpenAI's Fable voice balances natural expressiveness, speed, and cost:

I’m not sure what OpenAI was going for here, but the Fable voice strikes me like a robot Daniel Radcliffe, and I’m down with that.

Push Button, Automate Everything

So, let's automate everything! GPT-4 for rules/writing, OpenAI TTS for narration, and Python to connect it all. Since this is a prototype, we're just going to simulate the back-and-forth of the above ChatGPT conversation by glomming user inputs and GPT responses together into an increasingly large prompt katamari. This is suboptimal (for speed, cost, and context length reasons), but it's a quick way to start and debug things. Building up the conversation up is pretty straightforward, with something like this each time:

messages += [{"role": "user", "content": prompt}]

response = gpt_4_generate(messages, temperature=1.0, cache=True)

messages += [{"role": "assistant", "content": response}]

A higher temperature (ideally) yields more varied results, and I typically wrap GPT-4 generation in a function to cache responses if something goes wrong mid-process. The whole thing is then just a bunch of canned queries:

Hello! I hope you are doing well. Please outline for me the tenets of an award-winning {STORY_TYPE}.
Fabulous. Next, can you outline what's great about a {STORY_TYPE} with a {TWIST_TYPE}?
Great, thank you. Please outline for me the structure of a {ACTS_DESCRIPTION} short story.
Perfect. Can you please list for me about 10 INCREASINGLY INTERESTING ideas for a story about {STORY_SUBJECT} in that vein? Make each one MORE INTERESTING AND UNUSUAL than the previous. AVOID OVERUSED {STORY_TYPE} TROPES.
Fantastic! Pick one of the 3 last ones of those, then please outline a {STORY_TYPE} that will be about {STORY_SUBJECT}, in 5 parts, according to all the tenets you've come up with so far. Put a {TWIST_TYPE} into part 5, and make sure it's earned and transformative.
Okay. Next, please come up with a title for this. Make it distinct and interesting. Please just reply with the title and nothing else.
Thank you. Please write Part 1. This should be about 1,000 words long. Don't include titles or headers.
Thank you. Please write Part 2. Don't include titles or headers.
Thank you. Please write Part 3. Don't include titles or headers.
Thank you. Please write Part 4. Don't include titles or headers.
Thank you! Please conclude with Part 5. Make this can be succinct and impactful so that it doesn't feel drawn-out. Don't include titles or headers.

There's been research that speaking to GPT-4 nicely yields better results, ergo the "please"s and "thank-you"s above. I’d like to do some testing to make sure I’m not cargo cult prompt engineering, but it’ll do for the prototype.

The nice thing about this canned conversation is that I can simply change something like STORY_TYPE (which will typically be a genre), and it'll automatically come up with a structure for that on its own, then follow that structure to generate the narrative. TTS APIs are incredibly straightforward these days (using tts-1-hd and fable with OpenAI):

with open("my_credentials_here.txt", "rt") as f:
client = OpenAI(api_key=f.read())
response = client.audio.speech.create(
model="tts-1-hd",
voice="fable",
input=whatever
)

On average, the total cost for generating a complete story currently stands at approximately $0.45, and narration adds an additional $0.25, for a grand total of...

70 cents for a narrated audiobook short

The Results

After a few minutes (and seventy cents), the tool creates this:

This is perfect! While it isn’t a great narrative delivered by an expert VO actor, both the writing and narrative are miles better than what was possible even a year ago.

And for my purpose, that’s just fine—I just want something to help me get to sleep.

About the Author

Ichiro Lambe 11/22/23 Ichiro Lambe 11/22/23

50 AI Voices: Which Reigns Supreme?

Text-to-speech (TTS) tech is getting crazy good, and I’m on a mission to compare every voice available to me. Let’s start with three I’ve been using lately:

Amazon Polly, historically a quick, cheap workhorse for informational speech.
ElevenLabs, which has led on quality and expressiveness.
OpenAI Text-to-Speech, six brand-spanking-new voices from the creators of ChatGPT!

Here’s as exhaustive a survey of all* the voices for all these 3 services, posted here because someone suggested I stick ‘em all in one place (this one’s for you, Ethan!). Scroll on down if you just want to hear the samples.

(*Amazon Polly neural voices, ElevenLabs’s base voices, and OpenAI’s 6 new ones.)

First: A Bit of History

Text-to-speech (TTS) tech for consumers has come a long way. I remember playing with early attempts—the TI 99/4A had an add-on hardware module that offered pretty basic, robotic speech:

With the Alien Voice Box, the Atari 800 offered an improvement:

Flash forward a handful of decades, and by the 2010s, TTS voices had improved a ton. Here's the original Apple Siri commercial:

As a game developer who’s done a lot of work with procedural/dynamic narrative, I’m excited about the prospect of characters giving performances that won’t break the player out of the moment. Here in the far future of 2023, AI/ML TTS offers more varied voices and more natural, expressive speech:

While not perfect, that’s not too darned shabby for a computer. This was ElevenLabs, fed with straight text—no information about delivery or nothin’.

Three TTS Services

I picked these services because they’re reasonably-priced, robust, and have have APIs for programmatic speech generation. They’re all speaking the same phrase:

Welcome, weary traveler, to the tender embrace of tonight's moonlight serenade, a soothing lullaby caressed by the whispers of distant crickets. The veil of the night is drawn, and the world is drenched in an iridescent silver hue. The moon, a radiant orb, hangs in the sapphire sky, casting shadows and illusions while weaving a tapestry of tranquility.

That’s a kinda-nonsense ASMR, courtesy of ChatGPT-4.

First: Amazon Polly Neural

1. Amazon Polly - It’s inexpensive at $16 per million characters, and it’s quick—each of the following took less than a second to return an OGG. I’m exclusively using their “Neural” voices, which they claim to be their best-quality ones:

The voice stumbles on the intro (“Welcome… Weary traveler…”), but is otherwise not too bad. Here are the rest of their English language Neural voices:

I like Polly. It’s quick—it took less than a half second round-trip for each of those. It’s historically been one of the cheaper options, at $0.000016/character input. And the voice isn’t super expressive, but it’s solid enough to get a point across.

I’d use Amazon Polly whenever I wanted to generate a ton of informational text quickly.

Second: OpenAI’s Six New Voices

Hot off the presses, OpenAI, creators of ChatGPT, have their first batch of a half dozen voices:

They’ve priced this at $0.015 per 1k characters. To my ears, these sound pretty good. We're still getting a weird pause after "Welcome," but each of those has a solid cadence, and Fable even does a passable Daniel Radcliffe.

They still strike me as being most appropriate for narrative/information—here’s their Fable voice with the line that ElevenLabs Elli did, above:

Fable sounds like he’s reading a script about being furious, but is not actually very angry at all.

Third: ElevenLabs

ElevenLabs is my pick for the most expressive of the three, with the most natural/interesting delivery and the greatest breadth of voices:

Right off the bat, the initial pause in “Welcome, weary traveler…” is more natural.

Fin is particularly expressive. There’s movement in both pitch and volume, the pauses are at the right places, and you can even hear breaths. There’s some artifacting in the audio, but overall, this is fantastic.

As with Fin, there’s lots of expression in Giovanni’s performance.

There’s much variation in style and quality among the ElevenLabs voices—Grace is kinda reading this like a laundry list.

I love the deep resonance of the Michael voice. I’d listen to an audiobook of this.

There’s some noise here, but I like that Nicole’s doing a whispered ASMR thing. It shows ElevenLabs’s range.

Serena sounds extremely overdriven; I’d consider this one straight up unusable.

Not all of the voices are great (Josh, Rachel, and Grace are hamfistedly reading from a script, and Freya is too harsh), and that comma after "Welcome" still trips some of them up. But they all seem to have different personalities. Fin is giving it everything he's got—there's a TON of expression in there. Same with Giovanni.

The pricing structure is different than with Polly or ElevenLabs—for the above, I’m using their $22/mo subscription, which allows me to generate 100k characters in a month. That’s $0.00022—about 14x as expensive as Polly or OpenAI. And it’s slower, too—it took around 2s to generate each of those. But to my ears, the highest-quality, most expressive of the bunch.

And One More…

ElevenLabs allows you do roll the dice and create a voice at random. To me, this one sounds like Phil Hartman as a Midwestern sports announcer attempting an English accent.

Ichiro Lambe 3/3/23 Ichiro Lambe 3/3/23

Rapping About What’s On Steam

“Computer, please write and perform a rap about Obra Dinn.”

Some of you may recall my MicroTrailers bot, which cuts a 6-second trailer for every Steam game (and later became part of Steam proper during my infiltration of Valve tenure doing Steam Labs). Since then, AI’s found its way into all sorts of places—we have AI-generated food, AI-generated daycare, and entire AI-generated naked mole rat families—but we don’t have a system we can tell to rap about an arbitrary game, and I think this is an oversight.

Take Game, Produce Rap

What you see above is the Totally Human Rapper. I wanted it to be entirely autonomous, so it has two requirements:

1. I tell it a game.
2. It creates a whole rap video about that game.

It will research the game, gather media for it, write lyrics, and produce a finished product cut to the beat.

Step 1: The Auto-Researcher

GPT-3’s pretty capable. It knows about a lot of games, and you can ask it about a popular one that launched a few years ago:

Tell me about Deep Rock Galactic.
Deep Rock Galactic is a sci-fi first-person shooter game developed by Ghost Ship Games. It is set in an alien-infested, procedurally generated underground environment. Players take on the role of a Dwarf space miner, exploring the various environments while fighting off alien creatures…

However, here’s what happens if you ask it about a lesser-known one:

Tell me about Luck be a Landlord.
Luck be a Landlord is a strategic property trading card game that puts players in the role of a real estate investor. Players compete to buy, sell, and manage properties to build the most valuable portfolio.

Is this true?

This is a complete fabrication. You know the drill: expect GPT-3 to lie to you. Since I actually want this rapper to entertain and inform, it all needs to be accurate. One way to handle this would be to collect the information and feed it into GPT-3, but I’m lazy.

“Computer, go research it, yourself!”

Steam Store pages have just about all the information we need for starters:

Short Description: The text blurb at the upper-right of the page; usually around 50 words long.
About This Game: The longer description which does a deeper-dive into the game.
Factual Info: System requirements, number of players, and so forth.
Professional Reviews: Typically pull quotes from websites and streamers.
Player Reviews: Text of the player reviews sorted by “most helpful.”

They have a nice Web API which yields nice, tidy JSON info about a game, which is great—except we need more info than is stuffed in there, so we use Python and BeautifulSoup to scrape the actual HTML page as well.

If you’ve ever used BS4, you know how great it is. Go pick up Leonard Richardson’s sci-fi novel that has nothing to do with BS4.

Step 3: Generate the Lyrics

We’ll be sending this to GPT-3 text-davinci-003 (as I write this, OpenAI just released GPT-3.5, so more on that later!) as a single prompt. So, we need to tell our script everything it needs to know about the game and then ask it a single question. The request looks something like this:

At the end of that block of text is the actual request to generate the lyrics:

Write me a rhyming rap about this game.
Exclude verse and chorus headers.
Keep each line to about 8-10 words.

The script calls out to GPT-3’s API, which generates this:

Destiny two, it's something new,
A world of adventure, me and you,
Gun and sword, battle hordes,
Guardians of the Last City, of course,
Titans, Warlocks, Hunters, too,
Fight off the darkness and save the crew,
Exotic weapons, armor galore,
No other game has ever done more,
Rain devastation, clear the sky,
Write your legend in the stars, you and I.

That’s not bad for a computer!

It understood info about the game, and was able to pull out the important bits.
Line length was good—it can get lengthy if you don’t rein it in a bit.
It got rhyming couplets pretty good. “Hordes” and “of course” is a stretch, but I’ll take it.

There’s sometimes a bunch of cruft in GPT-3’s response (extra spacing or non-lyrical words like “Here’s Verse 1”), but we can clean that up procedurally.

Step 4: Rap It!

I used Uberduck for TTS. I wasn’t able to get their API working, so I did it the ugly way, using Selenium to automate logging into a browser, hit buttons, and download the audio files as though a user were sitting there hitting keystrokes. Here’s what the first line sounds like:

Step 5: Add the Music

For this exercise, I just stacked a bunch of loops for the backing track, easy peasy. The Totally Human Rapper then does this:

Chops that track into measures.
Aligns each TTS rap line to the beginning of a measure.
Composites it.

The backing tracks sounds like this:

Step 6: Slice and Dice Video

For visuals, it snags the first trailer on the game’s Steam Store page as source material. It lines the video cuts up to the beat; similar approach to what I did for MicroTrailers, like so:

Clip off the beginning and end of the trailer video. Typically this contains ESRB notices or URL title cards which aren’t super exciting for a rap video.
Get the length of a song’s measure—call it about 2.5s for a 90BPM track—and the number of lines we’re rapping (10, from above).
Snag 2.5s snippets of clips from the source video.
Shuffle those snippets and composite them into a single video.

Step 7: Put it All Together!

Backing track + TTS rap + composited video = finished AI-generated rap! That sounds like this:

Totally Human Rapper - Destiny 2

One More for Good Measure

Totally Human Rapper - Retired Men’s Nude Beach Volleyball

Retired Men's Nude Beach Volleyball League

A silly game, with a title so unique

Play as Len, an older gentleman

Navigate life, one serve at a time

A narrative based game, set in MA

Beautiful locations, you'll have to explore

Challenging and fun, you'll want more

It's a hidden gem, a passionate project

Narratives and themes, that you won't forget

Retired Men's Nude Beach Volleyball League

A game you'll love, a game you'll need

Listen to All the Raps So Far

That’s it! The goal’s to do one for every game on Steam, though that’ll take a while.

My next generative musical attempt will be to generate a melody and sing it. Have you heard Synthesizer V sing? It’s pretty good.

Anyway, I’d love to hear what other folks are doing in the area, so if you want to chat, I’ll be here all week.

Ichiro Lambe 1/5/23 Ichiro Lambe 1/5/23

AI Concepting: From 3 Hours to 30 Minutes, Thanks to Scenario.gg

I've been using Scenario.gg for a bit now, and it's been great for creating environment design mockups I can directly translate into 3D environments. It’s brought a 3-hour process down to 30 minutes.

Example for AaaaaAAaaaAAAaaAAAAaAAAAA!!!, our upcoming skydiving game:

These are all screenshots of Aaaaa! from levels I created by hand in Unity. I fine-tuned Scenario on these, then generated a bunch of images based on those:

The initial output was good, but a bit samey, and often hewed pretty close to the source material. So, I mixed it up a bit, in part by using different keywords:

“Daytime,” “Aztec",” and so forth. That yielded a greater variety in output. I then snagged one image that looked promising:

It, itself, was pretty close to one of my original screenshots (a contentious topic in AI art, but in this case, I created the level myself), but it was enough of a variation to be useful.

I then dove into Unity and created this. It’s a WiP, but only took me 30 minutes (minus the time waiting for training and image synthesis, during which I played Polytopia a bunch).

That 30 minutes is key here. Over the past few years, I’ve been dabbling in environment design, as I’ve always wanted to see what I could do. It takes me a lot of time to create stuff like this (and that’s even with using off-the-shelf decals):

It takes me that long, in part because I’m still a relative noob when it comes to this stuff; but also because I tend to do my concepting in-engine.

As a result, to try something, I have to drop into Substance Designer and Photoshop for textures, model a bunch of stuff, set up the lighting, move it around, and so forth. And if it doesn’t look good, then I trash it all and try again.

With these ML tools, I can just remix my screenshots and get a usable concept. That’s 3 hours down to 30 minutes for me, so cheers to Emmanuel and his team. :)