Speech
LMNT Speech API Endpoints - Usage Guide
LMNT offers three distinct speech synthesis endpoints, each optimized for different use cases and integration patterns. Choose the right endpoint based on your text availability, latency requirements, and metadata needs.
Here’s a very simple decision tree for these endpoints. Read on below for more specific details and tradeoffs.
Endpoints Comparison
1. POST /v1/ai/speech/bytes
- Generate Stream
Best for: Most use cases - flexible streaming or complete audio
- Input: Complete text (up to 5,000 characters)
- Output: Binary audio (can be streamed for low latency OR collected as complete file)
- Latency: Low - audio chunks stream as they’re synthesized
2. POST /v1/ai/speech
- Generate Detailed
Best for: When you specifically need word-level durations (timestamps) and you don’t care about latency
- Input: Complete text (up to 5,000 characters)
- Output: JSON response with base64-encoded audio + optional word-level timing information (durations array)
- Latency: Higher - waits for complete synthesis before response
3. WSS /v1/ai/speech/stream
- Speech Session (WebSocket)
Best for: Real-time applications with progressive text (e.g, voice assistants, chatbots)
- Input: Text streamed progressively (as it becomes available)
- Output: Binary audio chunks + optional word-level timing information (durations array)
- Latency: Ultra-low - Audio generated as text arrives
- Limitations:
- More complex to implement
- Not supported in all environments (some serverless/edge functions)