LMNT offers two primary surfaces to build with our models, each suited for different use cases.
| Speech API | Speech Sessions API | |
|---|---|---|
| What it is | Turn text into speech | Stream text in from an LLM, realtime speech out |
| Best for | Content with preproduced text like voiceovers, localization, narration, audiobooks, etc | Turning your favorite LLM into a realtime voice agent, and keeping your voices consistent as you upgrade LLMs |
| Learn more | Speech API docs | Speech Sessions API docs |
This guide covers common patterns for working with the Speech API, including speech generation, getting word timestamps, and making the generated speech sound more human. For complete API specifications, see the Speech API reference.
Basic speech generation
with client.speech.with_streaming_response.generate(
text='hello world.',
voice='leah',
) as response:
response.stream_to_file('hello.mp3')Speech generation with word timestamps
The Speech API allows you to get exact word timestamps when you need to sync your generated speech with subtitles, lip movement, or other modalities.
response = client.speech.generate_detailed(
text='hello world.',
voice='leah',
format='mp3',
return_durations=True,
)
with open('hello.mp3', 'wb') as f:
f.write(base64.b64decode(response.audio))
for d in response.durations:
print(f'{d.start:.3f}s {d.text!r}')Generating conversational speech
Conversational speech (talking to someone) feels quite different than read speech (audiobooks), and your text prompt has a big impact on the speech generated by the model.
See our text prompting guide to learn more.