Model capabilities

Word timestamps

LMNT's models return word timestamps, enabling you to sync subtitles, lip movement, other modalities, and more with your generated speech.

If you're producing video content, you often want to show subtitles.

Use LMNT to get exact word timing with precisely the words being spoken, instead of relying on external subtitle providers that try to guess and may confuse similar sounding words.

Getting timestamps with the Speech API

from lmnt import Lmnt
 
client = Lmnt()
 
response = client.speech.generate_detailed(
    text='Hello world.',
    voice='leah',
    return_durations=True,
)
 
for chunk in response.durations:
    print(f'"{chunk.text}" starts at {chunk.start:.3f}s and lasts for {chunk.duration:.3f}s')

Getting timestamps with the Speech Sessions API

In the Speech Sessions API, word durations currently take longer to arrive than the generated speech.

The generated speech continues to stream to you in realtime.

import asyncio
from lmnt import AsyncLmnt
 
async def main():
    client = AsyncLmnt()
    connection = await client.speech.sessions.create(
        voice='leah',
        return_extras=True,
    )
    await connection.append_text('Hello world.')
    await connection.finish()
 
    async for message in connection:
        for chunk in message.durations:
            print(f'"{chunk.text}" starts at {chunk.start:.3f}s and lasts for {chunk.duration:.3f}s')
 
asyncio.run(main())