Text-to-Speech
Text-to-Speech
POST /v2/tts synthesizes speech from a script and returns a hosted audio URL. Unlike the built-in TTS lip sync flow, this endpoint is standalone: it does not run lip sync. You get back a stable url for the synthesized audio, which you can preview, store, or — most usefully — reuse as an audio input in POST /v2/generate to lip sync that exact take onto a video.
Request
Send a JSON body with the script and the voice to synthesize with.
The text to synthesize into speech.
A voice id to synthesize with — either an ElevenLabs voice id (discover via GET /v2/voices) or the id of a voice cloned via POST /v2/voices.
Voice stability (0–1). Higher is more consistent, lower is more expressive.
How closely the synthesized audio matches the original voice (0–1).
Response
A 200 response returns the synthesized take.
A unique identifier for the synthesized audio.
The hosted URL of the synthesized audio. Reuse it as an audio input in POST /v2/generate to keep the same take across generations.
Duration of the synthesized audio in seconds.
Synthesize speech
Lead with the script and a voiceId. The curl example below is the source of truth for the request shape.
A successful response looks like this:
voiceId accepts any ElevenLabs voice id — list the voices available to your organization with GET /v2/voices — or the id of a voice you cloned via POST /v2/voices. Voice ids are case-sensitive.
Synthesize, then lip sync
The standalone endpoint is most powerful as the first half of a two-step flow: synthesize a take with /v2/tts, then pass the returned url as an audio input to /v2/generate to lip sync it onto a video.
The generation response echoes the synthesized take as synthesizedAudioUrl, so you can reuse the exact same audio across multiple generations without re-synthesizing. This is also present when you submit a text input directly to /v2/generate — see the built-in TTS lip sync flow.
Quotas and rate limits
Free-tier API keys share a monthly ElevenLabs allowance of 10 synthesis operations across TTS and dubbing combined. Paid plans are billed per use. See the Billing page for details.
POST /v2/tts is rate limited to 60 requests per minute per key. Exceeding the limit returns a 429 — back off and retry. See Rate Limiting for the recommended retry strategy.
FAQ
What's the difference between /v2/tts and TTS lip sync?
POST /v2/tts only synthesizes audio — it returns a hosted url and does not touch video. The built-in TTS lip sync flow passes a text input to POST /v2/generate, which synthesizes the audio and runs lip sync in a single call. Use /v2/tts when you want to inspect, reuse, or store the synthesized take before (or independent of) lip syncing it.
How do I find a voiceId?
Call GET /v2/voices to list the voices available to your organization, including built-in ElevenLabs voices and any you have cloned. Each entry includes an id you can pass as voiceId. To create your own voice, see Voice Cloning.
Can I reuse the same take across multiple videos?
Yes. The url returned by /v2/tts is stable — pass it as an audio input to as many POST /v2/generate calls as you like. The generation response also echoes it as synthesizedAudioUrl, so you can recover the exact take from a completed generation without re-synthesizing.
How do I handle longer scripts or multiple speakers?
Synthesize each section as a separate /v2/tts take, then assign the resulting audio urls to different time ranges with the Segments API. Each segment can reference a different audio input, which is how you build multi-speaker and long-form lip sync in a single generation.
Related
- Voice Cloning — clone a custom voice and synthesize with its id.
- Segments — assign different synthesized takes to different parts of the timeline.
- Text-to-Speech Lip Sync Guide — synthesize and lip sync in a single
/v2/generatecall.

