Text-to-Speech

POST /v2/tts synthesizes speech from a script and returns a hosted audio URL. Unlike the built-in TTS lip sync flow, this endpoint is standalone: it does not run lip sync. You get back a stable url for the synthesized audio, which you can preview, store, or — most usefully — reuse as an audio input in POST /v2/generate to lip sync that exact take onto a video.

Request

Send a JSON body with the script and the voice to synthesize with.

script
stringRequired

The text to synthesize into speech.

voiceId
stringRequired

A voice id to synthesize with — either an ElevenLabs voice id (discover via GET /v2/voices) or the id of a voice cloned via POST /v2/voices.

stability
double

Voice stability (0–1). Higher is more consistent, lower is more expressive.

similarityBoost
double

How closely the synthesized audio matches the original voice (0–1).

Response

A 200 response returns the synthesized take.

id
stringRequired

A unique identifier for the synthesized audio.

url
stringRequired

The hosted URL of the synthesized audio. Reuse it as an audio input in POST /v2/generate to keep the same take across generations.

duration
doubleRequired

Duration of the synthesized audio in seconds.

Synthesize speech

Lead with the script and a voiceId. The curl example below is the source of truth for the request shape.

$curl -X POST https://api.sync.so/v2/tts \
> -H "x-api-key: $SYNC_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{
> "script": "Hey there. I wanted to walk you through our latest features.",
> "voiceId": "EXAVITQu4vr4xnSDxMaL",
> "stability": 0.5,
> "similarityBoost": 0.75
> }'

A successful response looks like this:

1{
2 "id": "6533643b-aceb-4c40-967e-d9ba9baac39e",
3 "url": "https://assets.sync.so/docs/example-tts.mp3",
4 "duration": 2.1
5}

voiceId accepts any ElevenLabs voice id — list the voices available to your organization with GET /v2/voices — or the id of a voice you cloned via POST /v2/voices. Voice ids are case-sensitive.

Synthesize, then lip sync

The standalone endpoint is most powerful as the first half of a two-step flow: synthesize a take with /v2/tts, then pass the returned url as an audio input to /v2/generate to lip sync it onto a video.

1

Synthesize the take

Call POST /v2/tts and keep the returned url. This is your hosted audio take.

2

Lip sync it onto a video

Pass that url as an audio input in POST /v2/generate, alongside your video input.

3

Poll for completion

Poll GET /v2/generate/{id} until status is COMPLETED; outputUrl contains the lipsynced video.

$# 1. Synthesize the take and capture the hosted url
$TTS_URL=$(curl -s -X POST https://api.sync.so/v2/tts \
> -H "x-api-key: $SYNC_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{
> "script": "Hey there. I wanted to walk you through our latest features.",
> "voiceId": "EXAVITQu4vr4xnSDxMaL"
> }' | jq -r '.url')
$
$# 2. Lip sync the synthesized take onto a video
$curl -X POST https://api.sync.so/v2/generate \
> -H "x-api-key: $SYNC_API_KEY" \
> -H "Content-Type: application/json" \
> -d '{
> "model": "lipsync-2",
> "input": [
> { "type": "video", "url": "https://assets.sync.so/docs/example-video.mp4" },
> { "type": "audio", "url": "'"$TTS_URL"'" }
> ],
> "options": { "sync_mode": "cut_off" }
> }'

The generation response echoes the synthesized take as synthesizedAudioUrl, so you can reuse the exact same audio across multiple generations without re-synthesizing. This is also present when you submit a text input directly to /v2/generate — see the built-in TTS lip sync flow.

Quotas and rate limits

Free-tier API keys share a monthly ElevenLabs allowance of 10 synthesis operations across TTS and dubbing combined. Paid plans are billed per use. See the Billing page for details.

POST /v2/tts is rate limited to 60 requests per minute per key. Exceeding the limit returns a 429 — back off and retry. See Rate Limiting for the recommended retry strategy.

FAQ

POST /v2/tts only synthesizes audio — it returns a hosted url and does not touch video. The built-in TTS lip sync flow passes a text input to POST /v2/generate, which synthesizes the audio and runs lip sync in a single call. Use /v2/tts when you want to inspect, reuse, or store the synthesized take before (or independent of) lip syncing it.

Call GET /v2/voices to list the voices available to your organization, including built-in ElevenLabs voices and any you have cloned. Each entry includes an id you can pass as voiceId. To create your own voice, see Voice Cloning.

Yes. The url returned by /v2/tts is stable — pass it as an audio input to as many POST /v2/generate calls as you like. The generation response also echoes it as synthesizedAudioUrl, so you can recover the exact take from a completed generation without re-synthesizing.

Synthesize each section as a separate /v2/tts take, then assign the resulting audio urls to different time ranges with the Segments API. Each segment can reference a different audio input, which is how you build multi-speaker and long-form lip sync in a single generation.