Timing information returned by the API can be used to synchronize speech with other modalities, such as text or video. This can be useful for creating captions or subtitles, or for aligning speech with other media.

Timing information is available for both standard and streaming requests. Each punctuation sequence and word is associated with a start time and a duration (in seconds), which can be used to determine when the text is spoken and how long it lasts.

Example

This code sample uses the return_durations option to fetch timing information and print out its content.

import asyncio
from lmnt.api import Speech

async def main():
  options = { 'return_durations': True }
  async with Speech() as speech:
    synthesis = await speech.synthesize('Hello world.', 'lily', **options)
  for chunk in synthesis['durations']:
    print(f'"{chunk["text"]}" starts at {chunk["start"]:.3f}s and lasts for {chunk["duration"]:.3f}s')

asyncio.run(main())

It produces output that looks like this:

"" starts at 0.000s and lasts for 0.525s
"Hello" starts at 0.525s and lasts for 0.325s
" " starts at 0.850s and lasts for 0.000s
"world" starts at 0.850s and lasts for 0.400s
"." starts at 1.250s and lasts for 0.375s
"" starts at 1.625s and lasts for 0.013s

The empty string chunks correspond to the silence before and after the spoken text.

Streaming example

The example in this section shows how to fetch timing information with a streaming request. The code shown below is a minimal example, and in practice you would want to set up reader/writer tasks to handle the text input and synthesis output concurrently (see our Streaming example).

The option to return timing information in streaming requests is called return_extras. This option name is different from standard requests, where it’s called return_durations.

import asyncio
from lmnt.api import Speech

async def main():
  async with Speech() as speech:
    options = { 'return_extras': True }
    synthesis = await speech.synthesize_streaming('lily', **options)
    await synthesis.append_text('Hello world.')
    await synthesis.finish()
    async for synthesis in synthesis:
      for chunk in synthesis['durations']:
        print(f'"{chunk["text"]}" starts at {chunk["start"]:.3f}s and lasts for {chunk["duration"]:.3f}s')

asyncio.run(main())

The output is the same as in the previous example:

"" starts at 0.000s and lasts for 0.525s
"Hello" starts at 0.525s and lasts for 0.325s
" " starts at 0.850s and lasts for 0.000s
"world" starts at 0.850s and lasts for 0.400s
"." starts at 1.250s and lasts for 0.375s
"" starts at 1.625s and lasts for 0.013s