POST
/
v1
/
ai
/
speech
/
bytes
curl --request POST \
  --url https://api.lmnt.com/v1/ai/speech/bytes \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-API-Key: <api-key>' \
  --form voice=ava \
  --form 'text=hello world.' \
  --form model=aurora \
  --form language=en \
  --form format=mp3 \
  --form sample_rate=24000 \
  --form speed=1 \
  --form seed=123 \
  --form conversational=false \
  --form length=123 \
  --form top_p=1 \
  --form temperature=1
This response does not have an example.

Want to stream timestamps with your speech? Check out the streaming WebSocket endpoint and examples using the SDKs in the synchronizing timing guide.

Authorizations

X-API-Key
string
header
required

Your API key; get it from your LMNT account page.

Body

multipart/form-data
voice
string
required

The voice id of the voice to use for synthesis; voice ids can be retrieved by calls to List voices or Voice info

text
string
required

The text to synthesize; max 5000 characters per request (including spaces)

model
enum<string>
default:
aurora

The model to use for synthesis. One of aurora (default) or blizzard. Learn more about models here.

Available options:
aurora,
blizzard
language
enum<string>
default:
en

The desired language of the synthesized speech. Two letter ISO 639-1 code. Does not work with professional clones and the blizzard model.

Available options:
de,
en,
es,
fr,
pt,
zh,
ko,
hi
format
enum<string>
default:
mp3

The file format of the synthesized audio output

Available options:
aac,
mp3,
mulaw,
raw,
wav
sample_rate
enum<number>
default:
24000

The desired output sample rate in Hz

Available options:
8000,
16000,
24000
speed
number
default:
1

The talking speed of the generated speech, a floating point value between 0.25 (slow) and 2.0 (fast).

Required range: 0.25 < x < 2
seed
integer

Seed used to specify a different take; defaults to random

conversational
boolean
default:
false

Set this to true to generate conversational-style speech rather than reading-style speech. Does not work with the blizzard model.

length
number

Produce speech of this length in seconds; maximum 300.0 (5 minutes). Does not work with the blizzard model.

Required range: x < 300
top_p
number
default:
1

Controls the stability of the generated speech. A lower value (like 0.3) produces more consistent, reliable speech. A higher value (like 0.9) gives more flexibility in how words are spoken, but might occasionally produce unusual intonations or speech patterns.

Required range: 0 < x < 1
temperature
number
default:
1

Influences how expressive and emotionally varied the speech becomes. Lower values (like 0.3) create more neutral, consistent speaking styles. Higher values (like 1.0) allow for more dynamic emotional range and speaking styles.

Required range: x > 0

Response

200
application/octet-stream
OK

The response is of type file.

Was this page helpful?