Using the Speech API

LMNT offers two primary surfaces to build with our models, each suited for different use cases.

	Speech API	Speech Sessions API
What it is	Turn text into speech	Stream text in from an LLM, realtime speech out
Best for	Content with preproduced text like voiceovers, localization, narration, advertisements, audiobooks, etc	Turning your favorite LLM into a realtime voice agent, and keeping your voices consistent as you upgrade LLMs
Learn more	Speech API docs	Speech Sessions API docs

This guide covers common patterns for working with the Speech API, including speech generation, getting word timestamps, and making the generated speech sound more human. For complete API specifications, see the Speech API reference.

Basic speech generation

import asyncio
 
from lmnt import AsyncLmnt
 
async def main():
  client = AsyncLmnt()
  async with client.speech.with_streaming_response.generate(
    text=(
      "Uhh, did you see the weather in Palo Alto tomorrow? "
      "Yeah, can't believe it's gonna rain, dude. Like what?"
    ),
    voice='leah',
  ) as response:
    await response.stream_to_file('hello.mp3')
 
asyncio.run(main())

Speech generation with word timestamps

The Speech API allows you to get exact word timestamps when you need to sync your generated speech with subtitles, lip movement, or other modalities.

import asyncio
import base64
 
from lmnt import AsyncLmnt
 
async def main():
  client = AsyncLmnt()
  response = await client.speech.generate_detailed(
    text=(
      "Uhh, did you see the weather in Palo Alto tomorrow? "
      "Yeah, can't believe it's gonna rain, dude. Like what?"
    ),
    voice='leah',
    format='mp3',
    return_timestamps=True,
  )
  with open('hello.mp3', 'wb') as f:
    f.write(base64.b64decode(response.audio))
  for t in response.timestamps or []:
    print(f'{t.start:.3f}s  {t.text!r}')
 
asyncio.run(main())

Generating conversational speech

Conversational speech (talking to someone) feels quite different than read speech (audiobooks), and your text prompt has a big impact on the speech generated by the model.

See our text prompting guide to learn more.