Prompt engineering

Voice prompting

Comprehensive guide to voice prompt engineering for LMNT's latest models.

The right voice prompt is the difference between generated speech that is "just ok" and "wow". This living reference covers the elements you should think about and control for when crafting your voice prompts.

Find some examples of speech online that give the feeling you're looking for, and use them as a reference as you walk through this guide.


Acoustic basics

Environment

Are there background noises? Is there music? Low frequency hums from household appliances?

It's best to pull your voice prompt from speech recorded in an environment that matches the one you want the model to produce. But if that's not possible, you can attempt to clean up in post processing.

Room size and shape

You can hear the room size and shape in an audio recording. A bathroom sounds different from a closet which sounds different from a cathedral.

This is the room's impulse response: the pattern of reflections it adds to sound made inside. Hard surfaces bounce audio back, soft ones absorb it, and your ear stitches the reflections into a sense of space.

If you want the model to produce speech with a studio sound, prompt with a studio-like recording. If you want the model to produce speech that sounds like it's in a hall, prompt with a recording in a hall.

A blanket fort is an easy way to approximate an acoustically-treated room if you don't have access to a professional recording studio.

Clipping

Clipping is what happens when sound exceeds what the microphone can capture: the peaks get sliced off, and the result sounds harsh, crackly, and blown out. Think a weather reporter shouting into a hurricane. Or deep fried memes.

When this happens information is gone, and no amount of post-processing will bring it back. If you don't want clipping you need a different recording.

Sample rate

A recording's sample rate caps the frequencies it can carry. There's math behind it, but the main thing to remember the max recordable frequency is half the sample rate.

This has implications for your voice prompts:

  • Human speech generally goes up to ~10kHz. This means you need at least a 20kHz sample rate to capture the higher frequency details of sounds like 's' and 'f'.
  • If your voice prompt was recorded with a lower sample rate like 8kHZ or 16kHZ, it'll come out sounding muffled, thin, and tinny.

The model will accept a voice prompt with any sample rate, but you'll get best results with 24kHz or higher.

Be careful relying on an audio file's reported sample rate - the audio inside may have actually been originally recorded at a lower sample rate. When that happens you'll see a wall of missing higher frequencies.


Vocal basics

When deciding on your voice prompt, it's important to think about who's going to be listening to your generated speech, and what feeling you want to inspire in them. Here are some elements to think about.

Age

Do you want a young voice? An old voice?

Gender

Do you want a female voice? A male voice? A non-binary voice?

Accents

Where is the voice from? Which language(s) are they native speakers of?

Texture and tone

Are they a valley-girl with vocal fry? Are they a weathered rural voice in a political ad that implies "you can trust me, this politician is one of you too?"

Emotions

What emotions do you need?

Distinct brand voice vs localized voices

If you're generating speech in multiple languages, do you want a distinct voice shared across all languages? Or do you want different local voices for each language?


Prosody basics

Prosody is the musical component of speech, including rhythm, stress, intonation, and tempo. The model uses the prosody of the speech in your voice prompt (in addition to your text prompt) to help determine the prosody in the speech it generates.

Spontaneous vs read speech

The biggest and most obvious example of prosody you're familiar with is spontaneous speech vs read speech.

Read speech

Read speech is what you hear in audiobooks, scripted ads, and news broadcasts — someone reading words off a page. They can look ahead and know where they're going, so the pacing is even, the intonation predictable, and there are no ums and uhs. It sounds polished and performed.

Prompt with read speech when you want narration, voiceover, or anything that should feel composed and authoritative.

Spontaneous speech

Spontaneous speech is what you hear in podcasts, interviews, and ordinary conversation. Pacing is uneven, intonation more dynamic, and the signs of thinking-on-the-fly show up: ums, restarts, breaths, and hesitations.

Prompt with spontaneous speech when you want the output to feel conversational — a voice assistant, a character, a friend talking.

If your voice prompt has disfluencies like "um" and "uhh" in it, the model is much more likely to automatically add disfluencies even if the text prompt does not explicitly include them.

In a recording session and want spontaneous speech? Don't give a script. Have a conversation around a topic instead.

Use voice prompts from your use case

Use the voice prompt to show the model speech that feels right for your use cases. The model uses the text prompt as context to figure out the prosody you want, but the easier you make it for the model to know exactly what you want, the better results you'll get

Some examples:

  • If you're building a customer support agent, use speech from a real support interaction.
  • If you're creating voiceovers for ads, use speech that feels like a voiceover from an ad.
  • If you're creating a dramatic narrator, use speech from a dramatic narration.

Close your eyes, imagine your use case, and listen to your voice prompt. If you immediately feel the association, you probably have a good prompt.


Next steps