Building with LMNT

Using the Speech API

Practical patterns for using the Speech API effectively.

LMNT offers two primary surfaces to build with our models, each suited for different use cases.

Speech APISpeech Sessions API
What it isTurn text into speechStream text in from an LLM, realtime speech out
Best forContent with preproduced text like voiceovers, localization, narration, audiobooks, etcTurning your favorite LLM into a realtime voice agent, and keeping your voices consistent as you upgrade LLMs
Learn moreSpeech API docsSpeech Sessions API docs

This guide covers common patterns for working with the Speech API, including speech generation, getting word timestamps, and making the generated speech sound more human. For complete API specifications, see the Speech API reference.

Basic speech generation

with client.speech.with_streaming_response.generate(
    text='hello world.',
    voice='leah',
) as response:
    response.stream_to_file('hello.mp3')

Speech generation with word timestamps

The Speech API allows you to get exact word timestamps when you need to sync your generated speech with subtitles, lip movement, or other modalities.

response = client.speech.generate_detailed(
    text='hello world.',
    voice='leah',
    format='mp3',
    return_durations=True,
)
 
with open('hello.mp3', 'wb') as f:
    f.write(base64.b64decode(response.audio))
 
for d in response.durations:
    print(f'{d.start:.3f}s  {d.text!r}')

Generating conversational speech

Conversational speech (talking to someone) feels quite different than read speech (audiobooks), and your text prompt has a big impact on the speech generated by the model.

See our text prompting guide to learn more.