How to use the speech session API to stream text to the server and receive synthesized speech in real-time.
Create new secret key
, and copy key)sessions.create
method.appendText
/append_text
and concurrently read synthesized speech from the serverflush
or finish
to tell server to synthesize speech for all the text it’s still bufferedclose
.appendText
, the server will not
synthesize any speech until enough text has been received. We do this to gather
as much context as possible so that we can generate more natural-sounding
speech. The emotion and style in which a portion of text is spoken can vary
according to the entire context of a sentence, so the server will wait for
additional text as appropriate. Once the server has enough text buffered, it
will synthesize speech segments and return them to you.
As a result, just sending text may not immediately yield speech. This is where
flush
and finish
come in.
flush
and finish
are used to signal to the server that it should start
synthesizing speech with the text it has received so far. It tells the server to
not wait for any other text to fill out any additional speech context.
Let’s say that you have some text you want synthesized, so you stream text to
the LMNT servers. If the server has not yet received enough text, it will buffer
the text and wait for more. Even if you are streaming enough text to get the
servers to start synthesizing speech and streaming it back to you, once you are
done sending text, the server will still retain some buffered text as it waits
for more. You must notify the server that you are done by sending either a
flush
or a finish
to receive that last chunk of speech.
flush
and finish
?
flush
and finish
both signal to the server that it should synthesize all the
text it currently has. However, finish
also signals to the server that it
should close the connection after it has finished synthesizing.
As a result, flush
is critical in cases where you are momentarily done with
sending text and want speech returned to you, but you do not want the connection
closed yet. In applications where latency matters, you may not want to
repeatedly incur the latency cost of setting up a new websocket connection.
flush
allows you to keep a connection open while controlling when the server
should synthesize its text.
Let’s say that you are building a chatbot, and you implement it so that a single
connection is used throughout an entire conversation. When you want the bot to
speak, you send the text to LMNT and then call flush
to force the server to
synthesize all of that text. The connection remains open, so the next time you
want to synthesize speech, you send more text to LMNT and call flush
again. In
this scenario, you are ensuring that speech is returned to you as quickly as
possible.
Now consider the alternative, where you create a new connection for each message
and close the connection each time with finish
. The chatbot would respond much
slower.
flush
or finish
at the end of your text stream to ensure the server synthesizes all the speech you expected.