Text-to-Speech | sync. labs

POST /v2/tts synthesizes speech from a script and returns a hosted audio URL. Unlike the built-in TTS lip sync flow, this endpoint is standalone: it does not run lip sync. You get back a stable url for the synthesized audio, which you can preview, store, or — most usefully — reuse as an audio input in POST /v2/generate to lip sync that exact take onto a video.

Request

Send a JSON body with the script and the voice to synthesize with.

script

stringRequired

The text to synthesize into speech.

voiceId

stringRequired

A voice id to synthesize with — either an ElevenLabs voice id (discover via GET /v2/voices) or the id of a voice cloned via POST /v2/voices.

stability

double

Voice stability (0–1). Higher is more consistent, lower is more expressive.

similarityBoost

double

How closely the synthesized audio matches the original voice (0–1).

Response

A 200 response returns the synthesized take.

stringRequired

A unique identifier for the synthesized audio.

url

stringRequired

The hosted URL of the synthesized audio. Reuse it as an audio input in POST /v2/generate to keep the same take across generations.

duration

doubleRequired

Duration of the synthesized audio in seconds.

Synthesize speech

Lead with the script and a voiceId. The curl example below is the source of truth for the request shape.

$ curl -X POST https://api.sync.so/v2/tts \
>   -H "x-api-key: $SYNC_API_KEY" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "script": "Hey there. I wanted to walk you through our latest features.",
>     "voiceId": "EXAVITQu4vr4xnSDxMaL",
>     "stability": 0.5,
>     "similarityBoost": 0.75
>   }'

A successful response looks like this:

1 {
2   "id": "6533643b-aceb-4c40-967e-d9ba9baac39e",
3   "url": "https://assets.sync.so/docs/example-tts.mp3",
4   "duration": 2.1
5 }

voiceId accepts any ElevenLabs voice id — list the voices available to your organization with GET /v2/voices — or the id of a voice you cloned via POST /v2/voices. Voice ids are case-sensitive.

Synthesize, then lip sync

The standalone endpoint is most powerful as the first half of a two-step flow: synthesize a take with /v2/tts, then pass the returned url as an audio input to /v2/generate to lip sync it onto a video.

Synthesize the take

Call POST /v2/tts and keep the returned url. This is your hosted audio take.

Lip sync it onto a video

Pass that url as an audio input in POST /v2/generate, alongside your video input.

Poll for completion

Poll GET /v2/generate/{id} until status is COMPLETED; outputUrl contains the lipsynced video.

$ # 1. Synthesize the take and capture the hosted url
$ TTS_URL=$(curl -s -X POST https://api.sync.so/v2/tts \
>   -H "x-api-key: $SYNC_API_KEY" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "script": "Hey there. I wanted to walk you through our latest features.",
>     "voiceId": "EXAVITQu4vr4xnSDxMaL"
>   }' | jq -r '.url')
$ 
$ # 2. Lip sync the synthesized take onto a video
$ curl -X POST https://api.sync.so/v2/generate \
>   -H "x-api-key: $SYNC_API_KEY" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "model": "lipsync-2",
>     "input": [
>       { "type": "video", "url": "https://assets.sync.so/docs/example-video.mp4" },
>       { "type": "audio", "url": "'"$TTS_URL"'" }
>     ],
>     "options": { "sync_mode": "cut_off" }
>   }'

The generation response echoes the synthesized take as synthesizedAudioUrl, so you can reuse the exact same audio across multiple generations without re-synthesizing. This is also present when you submit a text input directly to /v2/generate — see the built-in TTS lip sync flow.

Quotas and rate limits

Free-tier API keys share a monthly ElevenLabs allowance of 10 synthesis operations across TTS and dubbing combined. Paid plans are billed per use. See the Billing page for details.

POST /v2/tts is rate limited to 60 requests per minute per key. Exceeding the limit returns a 429 — back off and retry. See Rate Limiting for the recommended retry strategy.

FAQ

What's the difference between /v2/tts and TTS lip sync?

POST /v2/tts only synthesizes audio — it returns a hosted url and does not touch video. The built-in TTS lip sync flow passes a text input to POST /v2/generate, which synthesizes the audio and runs lip sync in a single call. Use /v2/tts when you want to inspect, reuse, or store the synthesized take before (or independent of) lip syncing it.

How do I find a voiceId?

Call GET /v2/voices to list the voices available to your organization, including built-in ElevenLabs voices and any you have cloned. Each entry includes an id you can pass as voiceId. To create your own voice, see Voice Cloning.

Can I reuse the same take across multiple videos?

Yes. The url returned by /v2/tts is stable — pass it as an audio input to as many POST /v2/generate calls as you like. The generation response also echoes it as synthesizedAudioUrl, so you can recover the exact take from a completed generation without re-synthesizing.

How do I handle longer scripts or multiple speakers?

Synthesize each section as a separate /v2/tts take, then assign the resulting audio urls to different time ranges with the Segments API. Each segment can reference a different audio input, which is how you build multi-speaker and long-form lip sync in a single generation.

Voice Cloning — clone a custom voice and synthesize with its id.
Segments — assign different synthesized takes to different parts of the timeline.
Text-to-Speech Lip Sync Guide — synthesize and lip sync in a single /v2/generate call.

Request

Response

Synthesize speech

Synthesize, then lip sync

Synthesize the take

Lip sync it onto a video

Poll for completion

Quotas and rate limits

FAQ

What's the difference between /v2/tts and TTS lip sync?

How do I find a voiceId?

Can I reuse the same take across multiple videos?

How do I handle longer scripts or multiple speakers?

Related