Text-to-Speech Lip Sync Guide

Combine text-to-speech with lip sync to create talking head videos from just text and a source video. Type a script, pick a voice, and Sync generates a video where the speaker’s lips match the spoken words.

Using the Built-in ElevenLabs Integration

The fastest path. Sync’s ElevenLabs integration handles TTS and lipsync in a single API call — no need to generate and host audio separately.

1import { SyncClient } from "@sync.so/sdk";
2
3const sync = new SyncClient();
4
5async function main() {
6 const response = await sync.generations.create({
7 input: [
8 {
9 type: "video",
10 url: "https://assets.sync.so/docs/example-video.mp4",
11 },
12 {
13 type: "text",
14 provider: {
15 name: "elevenlabs",
16 voiceId: "EXAVITQu4vr4xnSDxMaL",
17 script: "Hey there. I wanted to walk you through our latest features. We shipped three major updates this week.",
18 stability: 0.5,
19 similarityBoost: 0.75,
20 },
21 },
22 ],
23 model: "lipsync-2",
24 options: { sync_mode: "cut_off" },
25 });
26
27 const jobId = response.id;
28 console.log(`Job submitted: ${jobId}`);
29
30 // Poll for completion
31 let generation = await sync.generations.get(jobId);
32 while (!["COMPLETED", "FAILED", "REJECTED"].includes(generation.status)) {
33 await new Promise((r) => setTimeout(r, 10000));
34 generation = await sync.generations.get(jobId);
35 }
36
37 if (generation.status === "COMPLETED") {
38 console.log(`Video ready: ${generation.outputUrl}`);
39 } else {
40 console.log(`Generation failed: ${jobId}`);
41 }
42}
43
44main();

ElevenLabs Provider Parameters

ParameterTypeDefaultDescription
namestringMust be "elevenlabs"
voiceIdstringElevenLabs voice ID
scriptstringText to speak (max 5,000 characters)
stabilityfloat0.5Voice stability (0.0-1.0). Lower = more expressive.
similarityBoostfloat0.75Voice similarity to original (0.0-1.0). Higher = closer match.

Enable the ElevenLabs integration from your Integrations settings. You can use Sync’s built-in integration or provide your own ElevenLabs API key (Creator plan or higher).

Using External TTS Providers

If you use a TTS provider other than ElevenLabs — Google Cloud TTS, Amazon Polly, Azure Speech, or any other service — generate the audio first, host it at a public URL, then pass it to Sync.

1import { SyncClient } from "@sync.so/sdk";
2
3const sync = new SyncClient();
4
5// Audio generated by your TTS provider, hosted at a public URL
6const ttsAudioUrl = "https://your-cdn.com/generated-speech.mp3";
7
8const response = await sync.generations.create({
9 input: [
10 { type: "video", url: "https://assets.sync.so/docs/example-video.mp4" },
11 { type: "audio", url: ttsAudioUrl },
12 ],
13 model: "lipsync-2",
14 options: { sync_mode: "cut_off" },
15});
16
17const jobId = response.id;
18console.log(`Job submitted: ${jobId}`);
19
20let generation = await sync.generations.get(jobId);
21while (!["COMPLETED", "FAILED", "REJECTED"].includes(generation.status)) {
22 await new Promise((r) => setTimeout(r, 10000));
23 generation = await sync.generations.get(jobId);
24}
25
26if (generation.status === "COMPLETED") {
27 console.log(`Video ready: ${generation.outputUrl}`);
28}

This approach works with any TTS provider. The only requirement is that the audio file is accessible via a public URL.

Voice Cloning Workflow

Clone a speaker’s voice with ElevenLabs, then use that cloned voice ID with Sync’s integration. The result: a video where the speaker looks AND sounds like themselves — speaking entirely new words.

1

Clone the voice

Upload a clean audio sample of the speaker to ElevenLabs to create a cloned voice. You get back a voice ID.

1# Use ElevenLabs API or dashboard to clone a voice
2# https://elevenlabs.io/docs/voices/voice-cloning
3cloned_voice_id = "your-cloned-voice-id"
2

Generate lipsync with the cloned voice

Use the cloned voice ID in your Sync API call.

1from sync import Sync
2from sync.common import Video, TTS, GenerationOptions
3
4sync = Sync()
5
6response = sync.generations.create(
7 input=[
8 Video(url="https://your-cdn.com/speaker-video.mp4"),
9 TTS(
10 provider={
11 "name": "elevenlabs",
12 "voiceId": cloned_voice_id,
13 "script": "This is the new script I want the speaker to say.",
14 "stability": 0.5,
15 "similarityBoost": 0.85, # Higher similarity for cloned voices
16 }
17 ),
18 ],
19 model="lipsync-2-pro", # Pro model for highest quality
20 options=GenerationOptions(sync_mode="cut_off"),
21)
3

Download the result

Poll for completion and retrieve the output video. The speaker now says the new script with their own voice and matching lip movements.

Best Practices

Keep scripts under 5,000 characters

The ElevenLabs integration has a 5,000-character limit per generation. For longer scripts, split them into segments using the Segments API, with each segment referencing a separate TTS input.

Tune voice settings

Stability controls how consistent the voice sounds. Lower values (0.2-0.4) produce more expressive, varied speech. Higher values (0.6-0.8) produce more consistent, predictable speech. Similarity boost controls how closely the output matches the original voice. For cloned voices, use higher values (0.8-0.9).

Use lipsync-2-pro for highest quality

For talking head videos where the face is prominent, lipsync-2-pro produces the best results. It handles detail around teeth, beards, and facial features better than other models. The trade-off is slower processing and higher cost.

Use react-1 for expressive results

For short clips (under 15 seconds) where you want the speaker to show emotion, use react-1 with an emotion prompt. The model generates facial expressions and head movements that match the audio tone.

Multi-Segment TTS

For longer scripts or multi-speaker scenarios, use the Segments API with multiple TTS inputs:

1from sync import Sync
2from sync.common import Video, TTS
3
4sync = Sync()
5
6response = sync.generations.create(
7 input=[
8 Video(url="https://your-cdn.com/video.mp4"),
9 TTS(
10 provider={
11 "name": "elevenlabs",
12 "voiceId": "voice-id-1",
13 "script": "Welcome to the first section of our presentation.",
14 },
15 ref_id="intro",
16 ),
17 TTS(
18 provider={
19 "name": "elevenlabs",
20 "voiceId": "voice-id-2",
21 "script": "Now let me hand it over to my colleague for the demo.",
22 },
23 ref_id="handoff",
24 ),
25 ],
26 segments=[
27 {"startTime": 0, "endTime": 8, "audioInput": {"refId": "intro"}},
28 {"startTime": 8, "endTime": 15, "audioInput": {"refId": "handoff"}},
29 ],
30 model="lipsync-2",
31)

Troubleshooting TTS

TTS generation failures typically come from one of three issues: an invalid voice ID, a script that exceeds the character limit, or a missing ElevenLabs API key configuration. First, verify that the voiceId you are passing is a valid ElevenLabs voice ID — voice IDs can expire if the voice is deleted from your ElevenLabs account or if you are referencing a shared voice that is no longer available. Second, check that your script field is under the 5,000-character limit; scripts that exceed this limit will be rejected. Third, confirm that the ElevenLabs integration is enabled in your Integrations settings. Free accounts use Sync’s built-in ElevenLabs key, while Creator plan and above can provide their own API key for higher quotas. Check the error response from the GET /v2/generate/{id} endpoint for the specific error code and message.

To find your ElevenLabs voice ID, log in to the ElevenLabs dashboard and navigate to the Voices section. Select the voice you want to use, then look for the voice ID in the URL bar or in the voice settings panel — it is a string of characters like EXAVITQu4vr4xnSDxMaL. You can also find voice IDs through the ElevenLabs API by calling their List Voices endpoint. If you are using a cloned voice, the voice ID is returned when you create the clone. Copy the voice ID exactly as shown and pass it as the voiceId parameter in your Sync API request. Note that voice IDs are case-sensitive. If you are using Sync’s built-in ElevenLabs integration on a free account, you can use any of the default ElevenLabs voices without needing your own ElevenLabs account.

Yes, Sync’s TTS integration supports multiple languages through ElevenLabs. ElevenLabs offers multilingual voice models that can generate speech in over 29 languages including Spanish, French, German, Portuguese, Japanese, Chinese, Arabic, Hindi, and many more. To use TTS in a non-English language, choose an ElevenLabs voice that supports your target language — multilingual voices are labeled as such in the ElevenLabs voice library. Write your script in the target language and the TTS engine will generate speech in that language. The lip sync model will then match the lip movements to the generated audio regardless of the language, as Sync’s lip sync models are language-agnostic. For the best results, select a voice that is native to your target language rather than relying on a single voice to handle all languages.

Next Steps

Support Knowledge Base