Text to Speech

Authorizations

Authorization

string

header

required

Authenticate with Authorization: Bearer <token>. The service accepts JWTs, API keys, and guest tokens through this bearer token header.

Body

application/json

JSON synthesis request. The response is streamed binary audio, not JSON.

text

string

required

Text to synthesize into speech.

voice

string

required

Voice ID to use for synthesis. Use a built-in system voice, a registered custom voice, or blank.

language

string

Optional ISO 639 language code. Yoruba input is diacritized before synthesis.

format

enum<string>

default:mp3

Output audio format. Silent signed provenance metadata is available for WAV, MP3, Ogg Opus, WebM Opus, and FLAC, but not raw PCM, mu-law, A-law, or HLS.

Available options:

wav,

mp3,

ogg_opus,

webm_opus,

flac,

pcm_s16le,

mulaw,

alaw,

m3u8

bitrate

enum<string>

default:128k

Output bitrate for compressed formats.

Available options:

32k,

48k,

64k,

96k,

128k,

192k

sample_rate

enum<integer>

default:24000

Output sample rate in hertz. Only 24000 is accepted.

Available options:

24000

speed

number

default:1

Speech speed multiplier.

Required range: 0.7 <= x <= 1.2

Response

Chunked streaming audio. The media type depends on the requested format; formats without a mapped media type are returned as application/octet-stream.

The response is of type file.