Generate speech (detailed)
Synthesizes speech from a text string and provides advanced information about the synthesis. Returns a JSON object that contains a base64-encoded audio file, the seed used in speech generation, and optionally an object detailing the duration of each spoken word.
The output of this POST request is a JSON object from which you must extract and decode the base64-encoded audio data. Here is an example of how to do so in your terminal:
jq -r '.audio' lmnt-output.json | base64 --decode > lmnt-audio-output.mp3
The file format of your audio output depends on the format
specified in the inital request (this example assumes format=mp3
).
Authorizations
Your API key; get it from your LMNT account page.
Body
The voice id of the voice to use for synthesis; voice ids can be retrieved by calls to List voices
or Voice info
The text to synthesize; max 5000 characters per request (including spaces)
The model to use for synthesis. One of aurora
(default) or blizzard
. Learn more about models here.
aurora
, blizzard
The desired language of the synthesized speech. Two letter ISO 639-1 code. Does not work with professional clones and the blizzard
model.
de
, en
, es
, fr
, pt
, zh
, ko
, hi
The file format of the synthesized audio output
aac
, mp3
, mulaw
, raw
, wav
The desired output sample rate in Hz
8000
, 16000
, 24000
The talking speed of the generated speech, a floating point value between 0.25
(slow) and 2.0
(fast).
0.25 < x < 2
Seed used to specify a different take; defaults to random
Set this to true
to generate conversational-style speech rather than reading-style speech. Does not work with the blizzard
model.
Produce speech of this length in seconds; maximum 300.0 (5 minutes). Does not work with the blizzard
model.
x < 300
Controls the stability of the generated speech. A lower value (like 0.3) produces more consistent, reliable speech. A higher value (like 0.9) gives more flexibility in how words are spoken, but might occasionally produce unusual intonations or speech patterns.
0 < x < 1
Influences how expressive and emotionally varied the speech becomes. Lower values (like 0.3) create more neutral, consistent speaking styles. Higher values (like 1.0) allow for more dynamic emotional range and speaking styles.
x > 0
If set as true
, response will contain a durations object.
Response
The base64-encoded audio file; the format is determined by the format
parameter.
The seed used to generate this speech; can be used to replicate this output take (assuming the same text is resynthsized with this seed number, see here for more details).
A JSON object outlining the spoken duration of each synthesized input element (words and non-words like spaces, punctuation, etc.). See an example of this object for the input string "Hello world!"
Was this page helpful?