Endpoints Comparison
1. POST /v1/ai/speech/bytes
- Generate Stream
Best for: Most use cases - flexible streaming or complete audio
- Input: Complete text (up to 5,000 characters)
- Output: Binary audio (can be streamed for low latency OR collected as complete file)
- Latency: Low - audio chunks stream as they’re synthesized
2. POST /v1/ai/speech
- Generate Detailed
Best for: When you specifically need word-level durations (timestamps) and you don’t care about latency
- Input: Complete text (up to 5,000 characters)
- Output: JSON response with base64-encoded audio + optional word-level timing information (durations array)
- Latency: Higher - waits for complete synthesis before response
3. WSS /v1/ai/speech/stream
- Speech Session (WebSocket)
Best for: Real-time applications with progressive text (e.g, voice assistants, chatbots)
- Input: Text streamed progressively (as it becomes available)
- Output: Binary audio chunks + optional word-level timing information (durations array)
- Latency: Ultra-low - Audio generated as text arrives
- Limitations:
- More complex to implement
- Not supported in all environments (some serverless/edge functions)