Model capabilities

Word timestamps

LMNT's models return word timestamps, enabling you to sync subtitles, lip movement, other modalities, and more with your generated speech.

If you're producing video content, you often want to show subtitles.

Use LMNT to get exact word timing with precisely the words being spoken, instead of relying on external subtitle providers that try to guess and may confuse similar sounding words.

Getting timestamps with the Speech API

import asyncio
 
from lmnt import AsyncLmnt
 
async def main():
  client = AsyncLmnt()
  response = await client.speech.generate_detailed(
    text=(
      "Uhh, did you see the weather in Palo Alto tomorrow? "
      "Yeah, can't believe it's gonna rain, dude. Like what?"
    ),
    voice='leah',
    return_timestamps=True,
  )
  for chunk in response.timestamps or []:
    print(f'"{chunk.text}" starts at {chunk.start:.3f}s and lasts for {chunk.duration:.3f}s')
 
asyncio.run(main())

Getting timestamps with the Speech Sessions API

In the Speech Sessions API, word timestamps currently take longer to arrive than the generated speech.

The generated speech continues to stream to you in realtime.

import asyncio
 
from lmnt import AsyncLmnt
 
async def main():
  client = AsyncLmnt()
  session = await client.speech.sessions.create(
    voice='leah',
    return_timestamps=True,
  )
  await session.send_text("Uhh, did you see the weather in Palo Alto tomorrow? ")
  await session.send_text("Yeah, can't believe it's gonna rain, dude. Like what?")
  await session.send_finish()
 
  async for message in session:
    if message.type == 'timestamps':
      for chunk in message.timestamps or []:
        print(f'"{chunk.text}" starts at {chunk.start:.3f}s and lasts for {chunk.duration:.3f}s')
 
asyncio.run(main())