Text to Speech
Synthesizes text into speech and streams audio chunks as they are generated. The response body is binary audio, not a JSON envelope. Signed, silent provenance metadata is embedded in WAV, MP3, Ogg Opus, WebM Opus, and FLAC output. Raw PCM, mu-law, A-law, and HLS output do not carry provenance metadata.
Authorizations
Authenticate with Authorization: Bearer <token>. The service accepts JWTs, API keys, and guest tokens through this bearer token header.
Body
JSON synthesis request. The response is streamed binary audio, not JSON.
Text to synthesize into speech.
Voice ID to use for synthesis. Use a built-in system voice, a registered custom voice, or blank.
Optional ISO 639 language code. Yoruba input is diacritized before synthesis.
Output audio format. Silent signed provenance metadata is available for WAV, MP3, Ogg Opus, WebM Opus, and FLAC, but not raw PCM, mu-law, A-law, or HLS.
wav, mp3, ogg_opus, webm_opus, flac, pcm_s16le, mulaw, alaw, m3u8 Output bitrate for compressed formats.
32k, 48k, 64k, 96k, 128k, 192k Output sample rate in hertz. Only 24000 is accepted.
24000 Speech speed multiplier.
0.7 <= x <= 1.2Response
Chunked streaming audio. The media type depends on the requested format; formats without a mapped media type are returned as application/octet-stream.
The response is of type file.