Choose (or create) a voice, send text, and recieve life-like speech spoken in that voice- it’s that simple.
Our AI models are trained on countless hours of speech data to automatically take speaking delivery and prosody cues from punctuation, letter capitilizations, and surrounding context. There’s no need to normalize your text or otherwise write it as you would pronounce it.
For instance, this text may result in speech that sounds flat:
This is the story of a bird… Did you know this bird lived on top of a mountain… This mountain was on an island.
Compared to this text which will result in speech that has the intonations and inflections expected from excalamation and question marks:
This is the story of a bird! Did you know this bird lived on top of a mountain? This mountain was on an island.
Conversational speech sounds different than narration, like a documentary or audiobook. To control for speaking styles, make a custom voice from source audio that matches your desired style.
Advanced Speech Controls
Variance in Speech Output & Unexpected Sounds
Occasionally your synthesized speech will include breathing or other sounds that are unexpected, like static noise. These are artifacts of machine learning which we’re reducing over time as our systems gather more training data.
If your speech doesn’t sound the way you’d like it to, try synthesizing it again; chances are the unexpected sounds won’t show up as every take is slightly different. If you consistently get an error, there may be a bug, and we’d appreciate if you can share details with us with us on Discord.
You can also preserve a particular variation or take by specifiying a