Our streaming WebSocket API endpoint: wss://api.lmnt.com/v1/ai/speech/stream

Overview

Our ultra-low latency, full-duplex streaming WebSocket API is ideal for applications like voice assistants and chatbots that need to be snappy and/or don’t have all the text upfront. Chat in our Playground is an example of streaming from an LLM.

Protocol

Our streaming endpoint uses a bidirectional protocol that sends both text and binary data. The protocol is based on the WebSocket Protocol and WebSocket API.

First message

The first message sent to the server must be a JSON object with the following fields:

X-API-Key
string
required

Your API key; get it from your account page.

voice
string
required

The voice id of the voice to use for synthesis. Voice ids can be retrieved by a call to List voices.

format
string
default: "mp3"

The desired output audio format. One of:

  • mp3: 96kbps MP3 audio. This format is useful for applications that need to play the audio directly to the user.
  • raw: 16-bit little-endian linear PCM audio. This format is useful for applications that need to process the audio further, such as adding effects or mixing multiple audio streams.
  • ulaw: 8-bit G711 µ-law audio with a WAV header. This format is most useful for telephony applications.
sample_rate
integer
default: "24000"

The desired output audio sample rate. One of:

  • 24000: 24kHz audio. This sample rate is useful for applications that need high-quality audio.
  • 16000: 16kHz audio. This sample rate is useful for applications that need to save bandwidth.
  • 8000: 8kHz audio. This sample rate is most useful for telephony applications and µ-law encoding.
speed
float
default: "1.0"

A float between 0.25 (slow) and 2.0 (fast) that controls the talking speed of the synthesized speech. A value of 1.0 is normal speed.

return_extras
boolean
default: "false"

Controls whether the server will return extra information about the synthesis. This information includes durations, buffer_empty, and warnings. See the Receiving Extras section for more information.

{
    "X-API-Key": "<LMNT_API_KEY>",
    "voice": "curtis",
    "format": "mp3",
    "sample_rate": 24000,
    "speed": 1.0,
    "return_extras": true,
}

Sending text

After the first message, you can send text to the server as a JSON object:

{"text": "Hello, world!"}

The text you send can be split at any point. For example, sending:

{"text": "This is a test of the emergency broadcast system"}

is semantically equivalent to sending these two messages:

{"text": "This is a test of the eme"}
{"text": "rgency broadcast system"}

Flushing (trigger synthesis)

If you want to force the server to synthesize the text that it has without closing the connection, you can send a JSON object with the flush field set to true:

{"flush": true}

You will be notified when the server has finished synthesizing all the text that it has by the buffer_empty field in the extra information. See the Receiving Extras section for more information.

Be careful when using flush. Our models are designed to factor in context when synthesizing audio. When flushing the buffer at arbitrary points, your speech may sound less natural.

The last message

If you are done sending text to the server, you can close the connection by sending a JSON object with the eof field set to true. This will cause the server to synthesize all the text it has and then close the connection.

{"eof": true}

Receiving audio

Once the server has received enough text, the server will respond with chunks of 96kbps mono MP3 audio with a sampling rate of 24kHz. As more text is streamed to the server, the server will continue to send more audio chunks. The audio chunks are sent as binary data.

To produce the most natural-sounding speech, our API waits for roughly two full sentences to be sent before synthesizing audio. If you want to receive audio before this threshold is met, you can use the flush field to force the server to synthesize the text it has.

Receiving extras

If you set the return_extras field to true in the first message, the server will also send extra information about each synthesized chunk. This information is sent as a serialized JSON object (string) and will be sent before its corresponding audio chunk. The extra information includes:

durations
array of duration objects

An array of objects that detail the duration of each text token in the synthesized chunk. This information is useful for applications that need to synchronize the synthesized audio with the text that was sent to the server. The format of each object is described below.

The durations array resets its start time for each chunk of audio.

buffer_empty
boolean

Indicates whether the server has finished synthesizing all the text that it has received. This is useful for applications that want to know when the server has finished synthesizing all the text that it has without closing the connection.

warning
string

Contains any warnings that the server has encountered during synthesis, such as exceeding the number of free characters.

Here is an example of the extra information that the server sends:

{
  "durations": [
    {
      "text": "",
      "start": 0,
      "duration": 0.2
    },
    {
      "text": "Using",
      "start": 0.2,
      "duration": 0.4
    },
    {
      "text": " ",
      "start": 0.6,
      "duration": 0.025
    },
    {
      "text": "LMNT",
      "start": 0.625,
      "duration": 0.425
    },
    {
      "text": "",
      "start": 1.05,
      "duration": 0.025
    }
    ...
  ]
  "buffer_empty": false,
  "warning": "string"
}

Note that the extra data JSON is always sent before the audio chunk that it corresponds to. Take care to interpret incoming data correctly. Audio is sent as bytes and extra data is sent as a string.

Errors

If the server encounters an error, it will send a JSON object with a descriptive message in a string field: error.

{"error": "string"}

The server will then close the connection.

Examples

Take a look at our Python SDK source code and Node SDK source code.