Here’s a very simple decision tree for these endpoints. Read on below for more specific details and tradeoffs.

Endpoints Comparison

1. POST /v1/ai/speech/bytes - Generate Stream

Best for: Most use cases - flexible streaming or complete audio

  • Input: Complete text (up to 5,000 characters)
  • Output: Binary audio (can be streamed for low latency OR collected as complete file)
  • Latency: Low - audio chunks stream as they’re synthesized

2. POST /v1/ai/speech - Generate Detailed

Best for: When you specifically need word-level durations (timestamps) and you don’t care about latency

  • Input: Complete text (up to 5,000 characters)
  • Output: JSON response with base64-encoded audio + optional word-level timing information (durations array)
  • Latency: Higher - waits for complete synthesis before response

3. WSS /v1/ai/speech/stream - Speech Session (WebSocket)

Best for: Real-time applications with progressive text (e.g, voice assistants, chatbots)

  • Input: Text streamed progressively (as it becomes available)
  • Output: Binary audio chunks + optional word-level timing information (durations array)
  • Latency: Ultra-low - Audio generated as text arrives
  • Limitations:
    • More complex to implement
    • Not supported in all environments (some serverless/edge functions)