Video Translation API Guide

Video translation requires a multi-step pipeline: transcribe the original audio, translate the text, generate speech in the target language, and lip sync the new audio to the original video. Sync handles the final lipsync step. This guide walks through the full pipeline.

Full Pipeline Walkthrough

1

Transcribe the original audio

Extract the spoken words from your source video. OpenAI’s Whisper is a solid choice for transcription.

transcribe.py
1from openai import OpenAI
2
3client = OpenAI()
4
5# Extract audio from video first (using ffmpeg or similar)
6audio_file = open("original-audio.wav", "rb")
7
8transcript = client.audio.transcriptions.create(
9 model="whisper-1",
10 file=audio_file,
11 response_format="verbose_json",
12 timestamp_granularities=["segment"],
13)
14
15print(transcript.text)
16# Save segments with timestamps for alignment
17for segment in transcript.segments:
18 print(f"[{segment['start']:.1f}s - {segment['end']:.1f}s] {segment['text']}")

The segment timestamps are useful for aligning translated audio with the correct video sections, especially for multi-speaker or long-form content.

2

Translate the transcript

Translate the transcribed text into the target language. Use a translation API or LLM for this step.

translate.py
1from openai import OpenAI
2
3client = OpenAI()
4
5original_text = "Welcome to our platform. Today we'll walk through the new features."
6target_language = "Spanish"
7
8response = client.chat.completions.create(
9 model="gpt-4o",
10 messages=[
11 {
12 "role": "system",
13 "content": f"Translate the following text to {target_language}. "
14 f"Keep the tone natural and conversational. "
15 f"Return only the translated text.",
16 },
17 {"role": "user", "content": original_text},
18 ],
19)
20
21translated_text = response.choices[0].message.content
22print(translated_text)
23# "Bienvenidos a nuestra plataforma. Hoy repasaremos las nuevas funciones."

For production pipelines, consider specialized translation APIs (DeepL, Google Translate) for higher throughput and language coverage.

3

Generate speech in the target language

Convert the translated text to audio using a TTS service. ElevenLabs supports multilingual voice cloning — you can clone the original speaker’s voice and generate speech in the new language.

generate_speech.py
1import requests
2
3ELEVENLABS_API_KEY = "your-elevenlabs-key"
4VOICE_ID = "EXAVITQu4vr4xnSDxMaL" # Or a cloned voice ID
5
6response = requests.post(
7 f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}",
8 headers={
9 "xi-api-key": ELEVENLABS_API_KEY,
10 "Content-Type": "application/json",
11 },
12 json={
13 "text": "Bienvenidos a nuestra plataforma. Hoy repasaremos las nuevas funciones.",
14 "model_id": "eleven_multilingual_v2",
15 "voice_settings": {
16 "stability": 0.5,
17 "similarity_boost": 0.75,
18 },
19 },
20)
21
22with open("translated-audio.mp3", "wb") as f:
23 f.write(response.content)

Upload the generated audio to a publicly accessible URL for the next step.

4

Lip sync with Sync API

Send the original video and the translated audio to Sync. The API generates new lip movements matching the translated speech.

1import { SyncClient } from "@sync.so/sdk";
2
3const sync = new SyncClient();
4
5const response = await sync.generations.create({
6 input: [
7 { type: "video", url: "https://your-cdn.com/original-video.mp4" },
8 { type: "audio", url: "https://your-cdn.com/translated-audio.mp3" },
9 ],
10 model: "lipsync-2",
11 options: { sync_mode: "cut_off" },
12});
13
14const jobId = response.id;
15console.log(`Lipsync job submitted: ${jobId}`);
16
17// Poll for completion
18let generation = await sync.generations.get(jobId);
19while (!["COMPLETED", "FAILED", "REJECTED"].includes(generation.status)) {
20 await new Promise((r) => setTimeout(r, 10000));
21 generation = await sync.generations.get(jobId);
22}
23
24if (generation.status === "COMPLETED") {
25 console.log(`Translated video ready: ${generation.outputUrl}`);
26} else {
27 console.log(`Generation failed: ${jobId}`);
28}
5

Use webhooks for production

For production pipelines, replace polling with webhooks. Pass a webhookUrl when creating the generation and Sync sends a POST request when the job finishes.

1response = sync.generations.create(
2 input=[
3 Video(url="https://your-cdn.com/original-video.mp4"),
4 Audio(url="https://your-cdn.com/translated-audio.mp3"),
5 ],
6 model="lipsync-2",
7 webhook_url="https://your-app.com/webhooks/sync",
8)

Shortcut: Built-in ElevenLabs Integration

You can skip the separate TTS step by using Sync’s built-in ElevenLabs integration. Pass the translated text directly and Sync handles TTS and lipsync in one call.

1from sync import Sync
2from sync.common import Video, TTS, GenerationOptions
3
4sync = Sync()
5
6response = sync.generations.create(
7 input=[
8 Video(url="https://your-cdn.com/original-video.mp4"),
9 TTS(
10 provider={
11 "name": "elevenlabs",
12 "voiceId": "EXAVITQu4vr4xnSDxMaL",
13 "script": "Bienvenidos a nuestra plataforma. Hoy repasaremos las nuevas funciones.",
14 "stability": 0.5,
15 "similarityBoost": 0.75,
16 }
17 ),
18 ],
19 model="lipsync-2",
20 options=GenerationOptions(sync_mode="cut_off"),
21)

See the Integrations page for setup instructions and voice configuration.

Using the sync-examples Repository

For a complete, ready-to-run translation pipeline, check the sync-examples repository. The translation example includes transcription with Whisper, translation with GPT, TTS with ElevenLabs, and lipsync with Sync — all wired together.

$git clone https://github.com/synchronicity-labs/sync-examples.git
$cd sync-examples/translation/python
$pip install -r requirements.txt
$# Configure your API keys in args.py
$python main.py

Quality Optimization Tips

Choose the right model

Use lipsync-2 for standard translation jobs. Use lipsync-2-pro for premium content where facial detail (beards, teeth, wrinkles) matters. The quality difference is most visible in close-up shots.

Ensure audio quality

Clean, high-quality TTS audio produces better lipsync results. Use high-fidelity TTS models (like eleven_multilingual_v2) and avoid noisy or compressed audio files.

Match speaking pace

Translated text often has a different word count than the original. Tune your TTS speed settings so the translated audio duration roughly matches the original video length. This reduces artifacts from sync_mode adjustments.

Handling Long Videos

For videos longer than a few minutes, break them into segments:

  1. Transcribe with timestamps — Use Whisper’s segment output to identify natural break points.
  2. Translate segment by segment — Translate each chunk individually for better accuracy.
  3. Generate audio per segment — Create separate TTS audio files for each segment.
  4. Use the Segments API — Submit all segments in a single Sync API call with different audio inputs per time range. See the Segments Guide.

For batch translation of multiple videos, use the Batch API to submit up to 500 generation requests in one operation.

Next Steps