LMNT

Generate speech (binary stream)

speech.generate(**kwargs: SpeechGenerateParams) -> bytes
POST
/v1/ai/speech/bytes

Generates speech from text and streams the audio as binary data chunks in real-time as they are generated.

This is the recommended endpoint for most text-to-speech use cases. You can either stream the chunks for low-latency playback or collect all chunks to get the complete audio file.

Parameters

text
str
required

The text to synthesize; max 5000 characters per request (including spaces).

voice
str
required

The voice id of the voice to use; voice ids can be retrieved by calls to List voices or Voice info.

debug
Optional[bool]

When set to true, the generated speech will also be saved to your clip library in the LMNT playground.

format
Optional[Literal["aac", "mp3", "ulaw", "wav", "webm", "pcm_s16le", "pcm_f32le"]]

The desired output format of the audio. If you are using a streaming endpoint, you'll generate audio faster by selecting a streamable format since chunks are encoded and returned as they're generated. For non-streamable formats, the entire audio will be synthesized before encoding.

language
Optional[Literal["auto", "ar", "de", "en", "es", "fr", "hi", "id", "it", "ja", "ko", "nl", "pl", "pt", "ru", "sv", "th", "tr", "uk", "ur", "vi", "zh"]]

The desired language. Two letter ISO 639-1 code. Defaults to auto language detection, but specifying the language is recommended for faster generation.

model
Optional[Literal["blizzard"]]

The model to use for synthesis. Learn more about models here.

sample_rate
Optional[Literal[8000, 16000, 24000]]

The desired output sample rate in Hz. Defaults to 24000 for all formats except mulaw which defaults to 8000.

seed
Optional[int]

Seed used to specify a different take; defaults to random

temperature
Optional[float]

Influences how expressive and emotionally varied the speech becomes. Lower values (like 0.3) create more neutral, consistent speaking styles. Higher values (like 1.0) allow for more dynamic emotional range and speaking styles.

top_p
Optional[float]

Controls the stability of the generated speech. A lower value (like 0.3) produces more consistent, reliable speech. A higher value (like 0.9) gives more flexibility in how words are spoken, but might occasionally produce unusual intonations or speech patterns.

Returns

Returns a streaming binary response (bytes).