Streaming speech synthesis
Stream text to our servers and receive synthesized speech in real-time.
Our streaming WebSocket API endpoint: wss://api.lmnt.com/v1/ai/speech/stream
Overview
Our ultra-low latency, full-duplex streaming WebSocket API is ideal for applications like voice assistants and chatbots that need to be snappy and/or don’t have all the text upfront. Chat in our Playground is an example of streaming from an LLM.
Protocol
Our streaming endpoint uses a bidirectional protocol that sends both text and binary data. The protocol is based on the WebSocket Protocol and WebSocket API.
First message
The first message sent to the server must be a JSON object with the following fields:
Your API key; get it from your account page.
The voice id of the voice to use for synthesis. Voice ids can be retrieved by a call to List voices
.
The desired output audio format. One of:
mp3
: 96kbps MP3 audio. This format is useful for applications that need to play the audio directly to the user.raw
: 16-bit little-endian linear PCM audio. This format is useful for applications that need to process the audio further, such as adding effects or mixing multiple audio streams.ulaw
: 8-bit G711 µ-law audio with a WAV header. This format is most useful for telephony applications.
The desired language of the synthesized speech. Two letter ISO 639-1 code. One of de
, en
, es
, fr
, pt
, zh
, ko
, hi
. Only works with some system voices.
Set this to true
to generate conversational-style speech rather than reading-style speech.
The desired output audio sample rate. One of:
24000
: 24kHz audio. This sample rate is useful for applications that need high-quality audio.16000
: 16kHz audio. This sample rate is useful for applications that need to save bandwidth.8000
: 8kHz audio. This sample rate is most useful for telephony applications and µ-law encoding.
A float between 0.25 (slow) and 2.0 (fast) that controls the talking speed of the synthesized speech. A value of 1.0 is normal speed.
Controls whether the server will return extra information about the synthesis. This information includes durations
, buffer_empty
, and warnings
. See the Receiving Extras section for more information.
Sending text
After the first message, you can send text to the server as a JSON object:
The text you send can be split at any point. For example, sending:
is semantically equivalent to sending these two messages:
Flushing (trigger synthesis)
If you want to force the server to synthesize the text that it has without closing the connection, you can send a JSON object with the flush
field set to true
:
You will be notified when the server has finished synthesizing all the text that it has by the buffer_empty
field in the extra information. See the Receiving Extras section for more information.
Be careful when using flush
. Our models are designed to factor in context when synthesizing audio. When flushing the buffer at arbitrary points, your speech may sound less natural.
The last message
If you are done sending text to the server, you can close the connection by sending a JSON object with the eof
field set to true
. This will cause the server to synthesize all the text it has and then close the connection.
Receiving audio
Once the server has received enough text, the server will respond with chunks of 96kbps mono MP3 audio with a sampling rate of 24kHz. As more text is streamed to the server, the server will continue to send more audio chunks. The audio chunks are sent as binary data.
flush
field to force the server to synthesize the text it has.Receiving extras
If you set the return_extras
field to true
in the first message, the server will also send extra information about each synthesized chunk. This information is sent as a serialized JSON object (string) and will be sent before its corresponding audio chunk. The extra information includes:
An array of objects that detail the duration of each text token in the synthesized chunk. This information is useful for applications that need to synchronize the synthesized audio with the text that was sent to the server. The format of each object is described below.
The durations array resets its start time for each chunk of audio.
Indicates whether the server has finished synthesizing all the text that it has received. This is useful for applications that want to know when the server has finished synthesizing all the text that it has without closing the connection.
Contains any warnings that the server has encountered during synthesis, such as exceeding the number of free characters.
Here is an example of the extra information that the server sends:
Note that the extra data JSON is always sent before the audio chunk that it corresponds to. Take care to interpret incoming data correctly. Audio is sent as bytes
and extra data is sent as a string
.
Errors
If the server encounters an error, it will send a JSON object with a descriptive message in a string field: error
.
The server will then close the connection.
Examples
Take a look at our Python SDK source code and Node SDK source code.
Was this page helpful?