Documentation

Introduction

Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. Use SSML for more customization in you speech audio. Official supported SSML

<speak>

The root element of the SSML response. SSML text starts and ends with this.

<break>

An empty element that controls pausing or other prosodic boundaries between words. Using <break> between any pair of tokens is optional. If this element is not present between words, the break is automatically determined based on the linguistic context.

Attribute
Description
Value
time Sets the length of the break by seconds or milliseconds 1s, 500ms
strength Sets the strength of the output's prosodic break by relative terms. x-weak, weak, medium, strong, x-strong

<say-as>

This element lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.

Attribute
Description
Value
interpret-as Sets the length of the break by seconds or milliseconds cardinal, ordinal, characters, fraction, expletive, unit, verbatim, date, time, telephone
format Optional attribute for text interpret as date and time Date: y, m, d Time: h, m, s, Z, 12, 24

<p>,<s>

Paragraph and sentence.

Use <s>...</s> tags to wrap full sentences, especially if they contain SSML elements that change prosody (that is, <audio>, <break>, <emphasis>, <par>, <prosody>, <say-as>, <seq>, and <sub>). If a break in speech is intended to be long enough that you can hear it, use <s>...</s> tags and put that break between sentences. </p>

<sub>

Indicate that the text in the alias attribute value replaces the contained text for pronunciation. You can also use the sub element to provide a simplified pronunciation of a difficult-to-read word. The last example below demonstrates this use case in Japanese.

<prosody>

Used to customize the pitch, speaking rate, and volume of text contained by the element. Currently the rate, pitch, and volume attributes are supported. The rate and volume attributes can be set according to the W3 specifications. There are three options for setting the value of the pitch attribute:

Attribute
Description
Value
rate Speaking rate non-negative percentage (+10.0%) x-slow, slow, medium, fast, x-fast, or default
pitch Pitch relative percentage (a number preceded by "+" or "-" and followed by "%", eg "-10%"), relative change (a number preceded by "+" or "-" and followed by "st", eg "+2st"), low, medium, high
volume Volume relative change (a number preceded by "+" or "-" and immediately followed by "dB" eg "+10dB"), silent, x-soft, soft, medium, loud, x-loud, or default

<emphasis>

Used to add or remove emphasis from text contained by the element. The <emphasis> element modifies speech similarly to <prosody>, but without the need to set individual speech attributes. These tags should only be used around a full sentence. Enclosing words within a sentence may cause unwanted pauses in speech.

Attribute
Description
Value
option The emphasis Element requests that the contained text be spoken with emphasis (also referred to as prominence or stress) strong, moderate, none, reduced