Building with LMNT

Using the Speech Sessions API

Practical patterns for building realtime speech experiences using your favorite LLM + LMNT.

LMNT offers two primary surfaces to build with our models, each suited for different use cases.

Speech APISpeech Sessions API
What it isTurn text into speechStream text in from an LLM, realtime speech out
Best forContent with preproduced text like voiceovers, localization, narration, audiobooks, etcTurning your favorite LLM into a realtime voice agent, and keeping your voices consistent as you upgrade LLMs
Learn moreSpeech API docsSpeech Sessions API docs

This guide covers common patterns for working with the Speech Sessions API, including streaming text from your LLM & streaming speech out, when to flush, handling user interruptions, and getting your LLM to produce conversational text. For complete API specifications, see the Speech Sessions API reference.

Basic text streaming in and speech streaming out

import asyncio
from anthropic import AsyncAnthropic
from lmnt import AsyncLmnt
 
DEFAULT_PROMPT = 'Read me an excerpt of a short sci-fi story in the public domain.'
VOICE_ID = 'elowen'
 
async def main():
  client = AsyncLmnt()
  connection = await client.speech.sessions.create(voice=VOICE_ID)
  t1 = asyncio.create_task(reader_task(connection))
  t2 = asyncio.create_task(writer_task(connection))
  await asyncio.gather(t1, t2)
 
 
async def reader_task(connection):
  """Streams audio data from LMNT and writes it to `output.mp3`."""
  with open('output.mp3', 'wb') as f:
    async for message in connection:
      f.write(message.audio)
 
 
async def writer_task(connection):
    """Streams text from Claude to LMNT."""
    client = AsyncAnthropic()
    async with client.messages.stream(
        model='claude-sonnet-4-6',
        max_tokens=1024,
        messages=[{'role': 'user', 'content': DEFAULT_PROMPT}],
    ) as stream:
        async for text in stream.text_stream:
            await connection.append_text(text)
            print(text, end='', flush=True)
 
    # After `finish` is called, the server will close the connection
    # when it has finished synthesizing.
    await connection.finish()
 
 
asyncio.run(main())

Flushing when your LLM has finished a turn

As you stream text in, your speech session buffers a small amount of text while it waits for enough context to generate natural-sounding speech.

Call flush the moment your LLM is done streaming text. The speech session generates all remaining text immediately and the connection stays open, ready for the next turn.

async for chunk in llm_stream:
    text = extract_text(chunk)
    if text:
        await connection.append_text(text)
 
await connection.flush()

If you forget to call flush (or finish), the last bit of text the LLM produced will sit in the buffer indefinitely.

Be careful when you send flush. If you send flush at arbitrary points instead of end of turn, your speach may sound less natural.

Getting your LLM to sound conversational

LLMs default to formal, structured responses that sound robotic when spoken aloud.

Prompting them with explicit guidance — that the response will be spoken, that contractions and fillers words belong, that bulleted lists and headers don't — has the biggest impact on how natural your speech sounds.

See our LLM prompting guide for a detailed breakdown and a copy-pasteable prompt template to use as a starting point