Synthesize speech
Synthesizes speech from a text string and provides advanced information about the synthesis. Returns a JSON object that contains a base64-encoded audio file, the seed used in speech generation, and optionally an object detailing the duration of each spoken word.
Specify either speed or length, not both; otherwise a request with both will result in an 500 server error as the desired speed might not match the desired length.
The output of this POST request is a JSON object from which you must extract and decode the base64-encoded audio data. Here is an example of how to do so in your terminal:
jq -r '.audio' lmnt-output.json | base64 --decode > lmnt-audio-output.mp3
The file format of your audio output depends on the format
specified in the inital request (this example assumes format=mp3
).
Authorizations
Your API key; get it from your LMNT account page.
Body
The voice id of the voice to use for synthesis; voice ids can be retrieved by calls to List voices
or Voice info
The text to synthesize; max 5000 characters per request (including spaces).
The desired language of the synthesized speech. Two letter ISO 639-1 code. One of de
, en
, es
, fr
, pt
, zh
, ko
, hi
. Does not work with professional clones and the blizzard
model.
The model to use for synthesis. One of aurora
(default) or blizzard
. Learn more about models here.
The file format of the synthesized audio output, either aac
, mp3
, mulaw
, raw
, wav
.
Set this to true
to generate conversational-style speech rather than reading-style speech. Does not work with the blizzard
model.
The desired output sample rate in Hz, one of: 8000
, 16000
, 24000
; defaults to 24000
for all formats except mulaw
which defaults to 8000
.
The talking speed of the generated speech, a floating point value between 0.25
(slow) and 2.0
(fast).
Produce speech of this length in seconds; maximum 300.0
(5 minutes). Does not work with the blizzard
model.
If set as true
, response will contain a durations object; see definition in the response section below.
Response
The base64-encoded audio file; the format is determined by the format
parameter.
The seed used to generate this speech; can be used to replicate this output take (assuming the same text is resynthsized with this seed number, see here for more details).
A JSON object outlining the spoken duration of each synthesized input element (words and non-words like spaces, punctuation, etc.). See an example of this object for the input string "Hello world!"
Was this page helpful?