LMNT

Generate speech (JSON with metadata)

speech.generateDetailed(body: SpeechGenerateDetailedParams): Promise<unknown>
POST
/v1/ai/speech

Generates speech from text and returns a JSON object that contains a base64-encoded audio string and optionally word-level durations (timestamps). This endpoint waits for the entire synthesis before responding, so it is not ideal for latency-sensitive applications.

Parameters

text
string
required

The text to synthesize; max 5000 characters per request (including spaces).

voice
string
required

The voice id of the voice to use; voice ids can be retrieved by calls to List voices or Voice info.

debug
boolean

When set to true, the generated speech will also be saved to your clip library in the LMNT playground.

format
'aac' | 'mp3' | 'ulaw' | 'wav' | 'webm' | 'pcm_s16le' | 'pcm_f32le'

The desired output format of the audio. If you are using a streaming endpoint, you'll generate audio faster by selecting a streamable format since chunks are encoded and returned as they're generated. For non-streamable formats, the entire audio will be synthesized before encoding.

language
'auto' | 'ar' | 'de' | 'en' | 'es' | 'fr' | 'hi' | 'id' | 'it' | 'ja' | 'ko' | 'nl' | 'pl' | 'pt' | 'ru' | 'sv' | 'th' | 'tr' | 'uk' | 'ur' | 'vi' | 'zh'

The desired language. Two letter ISO 639-1 code. Defaults to auto language detection, but specifying the language is recommended for faster generation.

model
'blizzard'

The model to use for synthesis. Learn more about models here.

return_durations
boolean

If set as true, response will contain a durations object.

sample_rate
8000 | 16000 | 24000

The desired output sample rate in Hz. Defaults to 24000 for all formats except mulaw which defaults to 8000.

seed
number

Seed used to specify a different take; defaults to random

temperature
number

Influences how expressive and emotionally varied the speech becomes. Lower values (like 0.3) create more neutral, consistent speaking styles. Higher values (like 1.0) allow for more dynamic emotional range and speaking styles.

top_p
number

Controls the stability of the generated speech. A lower value (like 0.3) produces more consistent, reliable speech. A higher value (like 0.9) gives more flexibility in how words are spoken, but might occasionally produce unusual intonations or speech patterns.

Returns

unknown where each SpeechGenerateDetailedResponse is:

audio
string
required

The base64-encoded audio file; the format is determined by the format parameter.

durations
Array<unknown>

A JSON object outlining the spoken duration of each synthesized input element (words and non-words like spaces, punctuation, etc.). See an example of this object for the input string "Hello world!"

seed
number
required

The seed used to generate this speech; can be used to replicate this output take (assuming the same text is resynthsized with this seed number, see here for more details).