Video Translation API Guide
Video translation requires a multi-step pipeline: transcribe the original audio, translate the text, generate speech in the target language, and lip sync the new audio to the original video. Sync handles the final lipsync step. This guide walks through the full pipeline.
Full Pipeline Walkthrough
Transcribe the original audio
Extract the spoken words from your source video. OpenAI’s Whisper is a solid choice for transcription.
The segment timestamps are useful for aligning translated audio with the correct video sections, especially for multi-speaker or long-form content.
Translate the transcript
Translate the transcribed text into the target language. Use a translation API or LLM for this step.
For production pipelines, consider specialized translation APIs (DeepL, Google Translate) for higher throughput and language coverage.
Generate speech in the target language
Convert the translated text to audio using a TTS service. ElevenLabs supports multilingual voice cloning — you can clone the original speaker’s voice and generate speech in the new language.
Upload the generated audio to a publicly accessible URL for the next step.
Lip sync with Sync API
Send the original video and the translated audio to Sync. The API generates new lip movements matching the translated speech.
Use webhooks for production
For production pipelines, replace polling with webhooks. Pass a webhookUrl when creating the generation and Sync sends a POST request when the job finishes.
Shortcut: Built-in ElevenLabs Integration
You can skip the separate TTS step by using Sync’s built-in ElevenLabs integration. Pass the translated text directly and Sync handles TTS and lipsync in one call.
See the Integrations page for setup instructions and voice configuration.
Using the sync-examples Repository
For a complete, ready-to-run translation pipeline, check the sync-examples repository. The translation example includes transcription with Whisper, translation with GPT, TTS with ElevenLabs, and lipsync with Sync — all wired together.
Quality Optimization Tips
Use lipsync-2 for standard translation jobs. Use lipsync-2-pro for premium content where facial detail (beards, teeth, wrinkles) matters. The quality difference is most visible in close-up shots.
Clean, high-quality TTS audio produces better lipsync results. Use high-fidelity TTS models (like eleven_multilingual_v2) and avoid noisy or compressed audio files.
Translated text often has a different word count than the original. Tune your TTS speed settings so the translated audio duration roughly matches the original video length. This reduces artifacts from sync_mode adjustments.
Handling Long Videos
For videos longer than a few minutes, break them into segments:
- Transcribe with timestamps — Use Whisper’s segment output to identify natural break points.
- Translate segment by segment — Translate each chunk individually for better accuracy.
- Generate audio per segment — Create separate TTS audio files for each segment.
- Use the Segments API — Submit all segments in a single Sync API call with different audio inputs per time range. See the Segments Guide.
For batch translation of multiple videos, use the Batch API to submit up to 500 generation requests in one operation.
Next Steps
- Video Dubbing API Guide — Focused guide for the dubbing step
- Text-to-Speech Lip Sync Guide — Combine TTS providers with lipsync
- Segments Guide — Multi-speaker and long-form video handling
- Batch API — Process multiple videos at scale

