Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. Use SSML for more customization in you speech audio. Official supported SSML
The root element of the SSML response. SSML text starts and ends with this.
An empty element that controls pausing or other prosodic boundaries between words. Using <break> between any pair of tokens is optional. If this element is not present between words, the break is automatically determined based on the linguistic context.
|time||Sets the length of the break by seconds or milliseconds||1s, 500ms|
|strength||Sets the strength of the output's prosodic break by relative terms.||x-weak, weak, medium, strong, x-strong|
This element lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.
|interpret-as||Sets the length of the break by seconds or milliseconds||cardinal, ordinal, characters, fraction, expletive, unit, verbatim, date, time, telephone|
|format||Optional attribute for text interpret as date and time||Date: y, m, d Time: h, m, s, Z, 12, 24|
Paragraph and sentence.
Use <s>...</s> tags to wrap full sentences, especially if they contain SSML elements that change prosody (that is, <audio>, <break>, <emphasis>, <par>, <prosody>, <say-as>, <seq>, and <sub>). If a break in speech is intended to be long enough that you can hear it, use <s>...</s> tags and put that break between sentences. </p>
Indicate that the text in the alias attribute value replaces the contained text for pronunciation. You can also use the sub element to provide a simplified pronunciation of a difficult-to-read word. The last example below demonstrates this use case in Japanese.
Used to customize the pitch, speaking rate, and volume of text contained by the element. Currently the rate, pitch, and volume attributes are supported. The rate and volume attributes can be set according to the W3 specifications. There are three options for setting the value of the pitch attribute:
|rate||Speaking rate||non-negative percentage (+10.0%) x-slow, slow, medium, fast, x-fast, or default|
|pitch||Pitch||relative percentage (a number preceded by "+" or "-" and followed by "%", eg "-10%"), relative change (a number preceded by "+" or "-" and followed by "st", eg "+2st"), low, medium, high|
|volume||Volume||relative change (a number preceded by "+" or "-" and immediately followed by "dB" eg "+10dB"), silent, x-soft, soft, medium, loud, x-loud, or default|
Used to add or remove emphasis from text contained by the element. The <emphasis> element modifies speech similarly to <prosody>, but without the need to set individual speech attributes. These tags should only be used around a full sentence. Enclosing words within a sentence may cause unwanted pauses in speech.
|option||The emphasis Element requests that the contained text be spoken with emphasis (also referred to as prominence or stress)||strong, moderate, none, reduced|
Smart watches and other wearables, like Apple Watch, Wear OS watch
Smartphones, like Google Pixel, Samsung Galaxy, Apple iPhone
Earbuds or headphones for audio playback, like Sennheiser headphones
Small home speakers, like Google Home Mini
Smart home speakers, like Google Home
Home entertainment systems or smart TVs, like Google Home Max, LG TV
Interactive Voice Response (IVR) systems