Text-to-speech (TTS) is exactly what it sounds like: input text, and out comes speech. LMNT’s TTS model is the best in the industry, with consistently low latency (~150ms between input and output), reliable service (competitors often experience multi hour long outages), and superb speech quality (we infuse the human “LMNT” in our speech models!). It can be useful in a variety of contexts, including supporting accessibility services, language learning, and content creation. The mechanistic process primarily consists of (a) text processing, analyzing the input text for linguistic structure (including pronunciation, intonation, and rhythm) and (b) speech synthesis (generating an audio output based on the processed text, imitating pre-recorded audio).

This example shows how to synthesize a simple text using the lily voice.

The same API can be used for more complex use cases, like synchronizing speech with captions or a video. See the API reference for more details on the capabilities of the API or the reference documentation for the synthesize method (Python, Node).