How AI Lip Sync Works
What is AI Lip Sync?
AI lip sync takes a video of a person speaking and a separate audio track, then generates new lip movements that match the audio. The original speaker’s face is modified frame by frame so the mouth movements align naturally with the target speech.
This means you can change what someone appears to say in a video — swap in translated audio, new dialogue, or text-to-speech output — and the result looks like the person actually spoke those words.
How Sync’s Models Work
Sync’s lipsync models process video through a three-stage pipeline:
1. Face Detection
The model scans each frame to locate and track the speaker’s face. It identifies facial landmarks around the mouth, jaw, and lower face. For videos with multiple people, you can specify which speaker to target using speaker selection.
2. Lip Movement Generation
The model analyzes the target audio and generates matching mouth shapes frame by frame. Sync’s models use 2-second independent chunks for inference. During this process, the model learns the speaker’s unique speaking style — how wide they open their mouth, their natural head tilt while talking, their teeth visibility patterns. This produces lip movements that look like the original speaker, not generic mouth shapes pasted onto a face.
3. Blending
The generated lip region is composited back into the original video frame. The model handles skin tone matching, lighting consistency, and edge blending so the modified mouth area integrates with the surrounding face. The rest of the frame — background, body, hair — stays untouched.
The result is a video where only the lip and lower face region has changed, with everything else preserved exactly as it was.
Choosing a Model
Sync offers multiple models optimized for different quality and speed tradeoffs. lipsync-2 is the recommended starting point for most applications. For premium quality with enhanced detail around teeth, beards, and facial features, use lipsync-2-pro. For expressive lip sync with emotion control and head movements on short-form content, use react-1.
For full specs, pricing, and detailed feature comparisons, see the Lipsync Models and React Models pages.
Common Use Cases
AI lip sync enables a wide range of applications across industries:
Localize training videos for global teams. Create instructor-led content in dozens of languages from a single recording.
Generate personalized video messages at scale. One recording becomes thousands of tailored outreach videos.
Dub content for international audiences while keeping natural lip movements intact.
Post-production dubbing, character dialogue editing, and lip sync for animated or game characters.
See the full Use Cases page for detailed examples and links to implementation guides.
Frequently Asked Questions
What is the difference between AI lip sync and traditional dubbing?
Traditional dubbing records a new voice actor for each language and relies on directors to approximate mouth timing. AI lip sync replaces the lip movements in the original video frame by frame so they match any new audio track exactly. This means one source video can be dubbed into dozens of languages with perfectly synchronized mouth movements in minutes instead of weeks.
Does AI lip sync work with any language?
Yes. Sync’s models operate on audio waveforms, not text, so they work with any spoken language — including tonal languages like Mandarin and Thai. As long as the input audio is clear, the model generates matching lip movements regardless of the language pair.
How long does AI lip sync processing take?
Processing time depends on the model and video length. lipsync-1.9.0-beta processes at roughly 3× real-time speed, while lipsync-2 and lipsync-2-pro run at approximately 1× real-time. A 60-second video typically completes in 20–60 seconds depending on the model selected.
Can AI lip sync handle multiple speakers in one video?
Yes. You can target a specific speaker using Sync’s speaker selection feature, which identifies individual faces in the frame. For videos with two or more speakers, submit separate generation requests per speaker or use segments to process each speaker’s portion independently.
Getting Started
Ready to build with Sync’s lip sync API?

