LMNT Speech API Endpoints - Usage Guide

Here’s a very simple decision tree for these endpoints. Read on below for more specific details and tradeoffs.

Endpoints Comparison

Best for: Most use cases - flexible streaming or complete audio

Input: Complete text (up to 5,000 characters)
Output: Binary audio (can be streamed for low latency OR collected as complete file)
Latency: Low - audio chunks stream as they’re synthesized

Best for: When you specifically need word-level durations (timestamps) and you don’t care about latency

Input: Complete text (up to 5,000 characters)
Output: JSON response with base64-encoded audio + optional word-level timing information (durations array)
Latency: Higher - waits for complete synthesis before response

Best for: Real-time applications with progressive text (e.g, voice assistants, chatbots)

Input: Text streamed progressively (as it becomes available)
Output: Binary audio chunks + optional word-level timing information (durations array)
Latency: Ultra-low - Audio generated as text arrives
Limitations:
- More complex to implement
- Not supported in all environments (some serverless/edge functions)