POST
/
v1
/
ai
/
speech
JavaScript
import Lmnt from 'lmnt-node';

const client = new Lmnt({
  apiKey: 'My API Key',
});

const response = await client.speech.generateDetailed({ text: 'hello world.', voice: 'leah' });

console.log(response.audio);
{
  "audio": "<string>",
  "durations": [
    {
      "text": "<string>",
      "duration": 123,
      "start": 123
    }
  ],
  "seed": 123
}
The output of this POST request is a JSON object from which you must extract and decode the base64-encoded audio data. Here is an example of how to do so in your terminal:jq -r '.audio' lmnt-output.json | base64 --decode > lmnt-audio-output.mp3The file format of your audio output depends on the format specified in the inital request (this example assumes format=mp3).

Authorizations

X-API-Key
string
header
required

Your API key; get it from your LMNT account page.

Body

application/json
voice
string
required

The voice id of the voice to use; voice ids can be retrieved by calls to List voices or Voice info.

Example:

"leah"

text
string
required

The text to synthesize; max 5000 characters per request (including spaces).

Example:

"hello world."

model
enum<string>
default:blizzard

The model to use for synthesis. Learn more about models here.

Available options:
blizzard
language
enum<string>
default:auto

The desired language. Two letter ISO 639-1 code. Defaults to auto language detection, but specifying the language is recommended for faster generation.

Available options:
auto,
ar,
de,
en,
es,
fr,
hi,
id,
it,
ja,
ko,
nl,
pl,
pt,
ru,
sv,
th,
tr,
uk,
ur,
vi,
zh
format
enum<string>
default:mp3

The desired output format of the audio. If you are using a streaming endpoint, you'll generate audio faster by selecting a streamable format since chunks are encoded and returned as they're generated. For non-streamable formats, the entire audio will be synthesized before encoding.

Streamable formats:

  • mp3: 96kbps MP3 audio.
  • ulaw: 8-bit G711 µ-law audio with a WAV header.
  • webm: WebM format with Opus audio codec.
  • pcm_s16le: PCM signed 16-bit little-endian audio.
  • pcm_f32le: PCM 32-bit floating-point little-endian audio.

Non-streamable formats:

  • aac: AAC audio codec.
  • wav: 16-bit PCM audio in WAV container.
Available options:
aac,
mp3,
ulaw,
wav,
webm,
pcm_s16le,
pcm_f32le
sample_rate
enum<number>
default:24000

The desired output sample rate in Hz. Defaults to 24000 for all formats except mulaw which defaults to 8000.

Available options:
8000,
16000,
24000
seed
integer

Seed used to specify a different take; defaults to random

debug
boolean
default:false

When set to true, the generated speech will also be saved to your clip library in the LMNT playground.

top_p
number
default:0.8

Controls the stability of the generated speech. A lower value (like 0.3) produces more consistent, reliable speech. A higher value (like 0.9) gives more flexibility in how words are spoken, but might occasionally produce unusual intonations or speech patterns.

Required range: 0 <= x <= 1
temperature
number
default:1

Influences how expressive and emotionally varied the speech becomes. Lower values (like 0.3) create more neutral, consistent speaking styles. Higher values (like 1.0) allow for more dynamic emotional range and speaking styles.

Required range: x >= 0
return_durations
boolean
default:false

If set as true, response will contain a durations object.

Example:

true

Response

OK

audio
string
required

The base64-encoded audio file; the format is determined by the format parameter.

seed
integer
required

The seed used to generate this speech; can be used to replicate this output take (assuming the same text is resynthsized with this seed number, see here for more details).

durations
object[]

A JSON object outlining the spoken duration of each synthesized input element (words and non-words like spaces, punctuation, etc.). See an example of this object for the input string "Hello world!"