Video translation requires a multi-step pipeline: transcribe the original audio, translate the text, generate speech in the target language, and lip sync the new audio to the original video. Sync Labs handles the final lipsync step. This guide walks through the full pipeline.
Extract the spoken words from your source video. OpenAI’s Whisper is a solid choice for transcription.
The segment timestamps are useful for aligning translated audio with the correct video sections, especially for multi-speaker or long-form content.
Translate the transcribed text into the target language. Use a translation API or LLM for this step.
For production pipelines, consider specialized translation APIs (DeepL, Google Translate) for higher throughput and language coverage.
Convert the translated text to audio using a TTS service. ElevenLabs supports multilingual voice cloning — you can clone the original speaker’s voice and generate speech in the new language.
Upload the generated audio to a publicly accessible URL for the next step.
Send the original video and the translated audio to Sync Labs. The API generates new lip movements matching the translated speech.
For production pipelines, replace polling with webhooks. Pass a webhookUrl when creating the generation and Sync Labs sends a POST request when the job finishes.
You can skip the separate TTS step by using Sync Labs’ built-in ElevenLabs integration. Pass the translated text directly and Sync Labs handles TTS and lipsync in one call.
See the Integrations page for setup instructions and voice configuration.
For a complete, ready-to-run translation pipeline, check the sync-examples repository. The translation example includes transcription with Whisper, translation with GPT, TTS with ElevenLabs, and lipsync with Sync Labs — all wired together.
Use lipsync-2 for standard translation jobs. Use lipsync-2-pro for premium content where facial detail (beards, teeth, wrinkles) matters. The quality difference is most visible in close-up shots.
Clean, high-quality TTS audio produces better lipsync results. Use high-fidelity TTS models (like eleven_multilingual_v2) and avoid noisy or compressed audio files.
Translated text often has a different word count than the original. Tune your TTS speed settings so the translated audio duration roughly matches the original video length. This reduces artifacts from sync_mode adjustments.
For videos longer than a few minutes, break them into segments:
For batch translation of multiple videos, use the Batch API to submit up to 500 generation requests in one operation.