Streaming example
Streaming is the ability to concurrently receive text input and produce speech output in real-time. This can be very useful in a variety of cases, including hosting customer support bots, virtual assistants, and live captioning and transcription, among others. Streaming minimizes the latency between input and output, providing a seamless and natural user experience.
Setup
Below is an example of linking ChatGPT with our streaming API. Before starting:
- Ensure you’ve completed the Environment setup section.
- Since we’re using OpenAI, you’ll also need an OpenAI API key (login, click on
Create new secret key
, and copy key)
Summary
Here is a summary of what the code below does.
- Creates a streaming connection with the
synthesize_streaming
method. - Sends text to the server using
appendText
and concurrently read synthesized speech from the server - The server buffers text and synthesizes speech when it has enough.
- Repeats step 2 until you have no more text to send.
- Calls
flush
orfinish
to tell server to synthesize speech for all the text it’s still buffered - Closes connection by calling
close
.
Concurrent streaming
We’ll use two tasks to handle the streaming data: one to read from ChatGPT and write to LMNT, and another to read from LMNT and write to a file. Both of these tasks are asynchronous and run concurrently.
Server buffering
When text is sent to the LMNT servers via appendText
, the server will not
synthesize any speech until enough text has been received. We do this to gather
as much context as possible so that we can generate more natural-sounding
speech. The emotion and style in which a portion of text is spoken can vary
according to the entire context of a sentence, so the server will wait for
additional text as appropriate. Once the server has enough text buffered, it
will synthesize speech segments and return them to you.
As a result, just sending text may not immediately yield speech. This is where
flush
and finish
come in.
Flushing the server buffer
flush
and finish
are used to signal to the server that it should start
synthesizing speech with the text it has received so far. It tells the server to
not wait for any other text to fill out any additional speech context.
Let’s say that you have some text you want synthesized, so you stream text to
the LMNT servers. If the server has not yet received enough text, it will buffer
the text and wait for more. Even if you are streaming enough text to get the
servers to start synthesizing speech and streaming it back to you, once you are
done sending text, the server will still retain some buffered text as it waits
for more. You must notify the server that you are done by sending either a
flush
or a finish
to receive that last chunk of speech.
Flush vs Finish
So what’s the difference between flush
and finish
?
flush
and finish
both signal to the server that it should synthesize all the
text it currently has. However, finish
also signals to the server that it
should close the connection after it has finished synthesizing.
As a result, flush
is critical in cases where you are momentarily done with
sending text and want speech returned to you, but you do not want the connection
closed yet. In applications where latency matters, you may not want to
repeatedly incur the latency cost of setting up a new websocket connection.
flush
allows you to keep a connection open while controlling when the server
should synthesize its text.
Let’s say that you are building a chatbot, and you implement it so that a single
connection is used throughout an entire conversation. When you want the bot to
speak, you send the text to LMNT and then call flush
to force the server to
synthesize all of that text. The connection remains open, so the next time you
want to synthesize speech, you send more text to LMNT and call flush
again. In
this scenario, you are ensuring that speech is returned to you as quickly as
possible.
Now consider the alternative, where you create a new connection for each message
and close the connection each time with finish
. The chatbot would respond much
slower.
Make sure you call either flush
or finish
at the end of your text stream to ensure the server synthesizes all the speech you expected.
Was this page helpful?