Streaming is the ability to concurrently receive text input and produce speech output in real-time. This can be very useful in a variety of cases, including hosting customer support bots, virtual assistants, and live captioning and transcription, among others. Streaming minimizes the latency between input and output, providing a seamless and natural user experience.

Setup

Below is an example of linking ChatGPT with our streaming API. Before starting:

  1. Ensure you’ve completed the Environment setup section.
  2. Since we’re using OpenAI, you’ll also need an OpenAI API key (login, click on Create new secret key, and copy key)

Summary

Here is a summary of what the code below does.

  1. Creates a streaming connection with the synthesize_streaming method.
  2. Sends text to the server using appendText and concurrently read synthesized speech from the server
  3. The server buffers text and synthesizes speech when it has enough.
  4. Repeats step 2 until you have no more text to send.
  5. Calls flush or finish to tell server to synthesize speech for all the text it’s still buffered
  6. Closes connection by calling close.

Concurrent streaming

We’ll use two tasks to handle the streaming data: one to read from ChatGPT and write to LMNT, and another to read from LMNT and write to a file. Both of these tasks are asynchronous and run concurrently.

Server buffering

When text is sent to the LMNT servers via appendText, the server will not synthesize any speech until enough text has been received. We do this to gather as much context as possible so that we can generate more natural-sounding speech. The emotion and style in which a portion of text is spoken can vary according to the entire context of a sentence, so the server will wait for additional text as appropriate. Once the server has enough text buffered, it will synthesize speech segments and return them to you.

As a result, just sending text may not immediately yield speech. This is where flush and finish come in.

Flushing the server buffer

flush and finish are used to signal to the server that it should start synthesizing speech with the text it has received so far. It tells the server to not wait for any other text to fill out any additional speech context.

Let’s say that you have some text you want synthesized, so you stream text to the LMNT servers. If the server has not yet received enough text, it will buffer the text and wait for more. Even if you are streaming enough text to get the servers to start synthesizing speech and streaming it back to you, once you are done sending text, the server will still retain some buffered text as it waits for more. You must notify the server that you are done by sending either a flush or a finish to receive that last chunk of speech.

Flush vs Finish

So what’s the difference between flush and finish?

flush and finish both signal to the server that it should synthesize all the text it currently has. However, finish also signals to the server that it should close the connection after it has finished synthesizing.

As a result, flush is critical in cases where you are momentarily done with sending text and want speech returned to you, but you do not want the connection closed yet. In applications where latency matters, you may not want to repeatedly incur the latency cost of setting up a new websocket connection. flush allows you to keep a connection open while controlling when the server should synthesize its text.

Let’s say that you are building a chatbot, and you implement it so that a single connection is used throughout an entire conversation. When you want the bot to speak, you send the text to LMNT and then call flush to force the server to synthesize all of that text. The connection remains open, so the next time you want to synthesize speech, you send more text to LMNT and call flush again. In this scenario, you are ensuring that speech is returned to you as quickly as possible.

Now consider the alternative, where you create a new connection for each message and close the connection each time with finish. The chatbot would respond much slower.

Make sure you call either flush or finish at the end of your text stream to ensure the server synthesizes all the speech you expected.