Optimizing latency

Latency is the time delay between a system receiving an input and producing an output (i.e., lower latency means faster responses). Here are some tips for optimizing the latency you experience when using our model:

Handle audio chunk by chunk

Generating speech with the POST /v1/ai/speech/bytes endpoint returns a stream of audio chunks. You can handle these chunks as they become available, or you can wait for the entire stream to finish before processing the audio.

Use the real-time speech session API

Don’t have all your text ready up front? Use our real-time speech session API to send text to the server as it becomes available. See an example and our guide for more details.

Use an SDK

Our SDKs are designed to handle the low-level details of our API, and are optimized for low latency.

Use pcm_s16le or pcm_f32le

Use the pcm_s16le or pcm_f32le format. It’s the fastest format we offer and returns 16-bit or 32-bit raw audio.

Use async tasks

Use asynchronous tasks to stream data concurrently. See the speech session example above for a reference implementation.

Use servers in the U.S.

Our API and GPU servers are located in the United States. Thus, although we support streaming worldwide, users in the U.S. are most likely to experience lowest latency. If you have any specific geographic constraints, reach out (hello@lmnt.com)

Specify the language

If you know what language the text is in, specify it in the language parameter. This will skip language detection and generate the speech faster.

Overview

Getting Started (AI Tools)

Getting Started (SDKs)

Guides

Integrations

Migrations

Optimizing latency