Improving Lip Sync Quality

Lip sync quality depends on the quality of your input video, the clarity of your audio, the model you choose, and how you configure sync mode settings. Following these guidelines will help you get the most natural-looking results from every generation.

Video Input Best Practices

The quality and composition of your input video has the biggest impact on lip sync output. Follow these guidelines for the best results.

Use a front-facing camera angle

Frontal or near-frontal face angles produce the best lip sync results on lipsync-2 and lipsync-2-pro. Extreme profile (side-view) shots make face detection unreliable and can cause distorted output on these models. sync-3 natively supports extreme face angles including profiles and over-the-shoulder shots, but front-facing angles still produce the best results across all models.

Ensure good, even lighting

Well-lit faces with even lighting produce the cleanest results. Avoid harsh shadows across the face, backlighting that silhouettes the speaker, or flickering light sources that change frame to frame.

Minimize face obstructions

Keep the speaker’s mouth and lower face area clear of obstructions. Hands, microphones, hair, sunglasses, and other objects covering the mouth region degrade lip sync accuracy on lipsync-2 and lipsync-2-pro — if obstructions are unavoidable on these models, enable the occlusion_detection_enabled option. sync-3 detects and handles obstructions automatically with no configuration needed.

Use stable footage

Shaky or jittery footage makes face tracking less reliable. Use a tripod or stabilized camera when possible. If using handheld footage, apply video stabilization in post-production before submitting to Sync.

Keep one face in frame

For the best results, have a single speaker visible in the frame. If your video has multiple faces, use the Speaker Selection feature to target the correct person. Without speaker selection, the model may sync to the wrong face.

Meet resolution requirements

Use at least 480p resolution for reliable face detection. Higher resolutions up to 4K (4096x2160) are supported and can improve output quality. We recommend 1080p as the best balance of quality and processing speed. Use MP4 with H.264 codec for optimal compatibility.

Audio Input Best Practices

Clear, well-recorded audio is essential for accurate lip-to-speech alignment.

Use clean speech without background noise

Background music, crowd noise, and overlapping conversations degrade lip sync accuracy. Isolate the speaker’s voice as much as possible. If your audio has background noise, use a noise reduction tool before submitting. For song lip sync, isolate and upload just the vocals track.

One speaker per audio track

Each audio input should contain a single speaker. For multi-speaker scenarios, use the Segments API to assign different audio tracks to different time ranges, each with its own speaker.

Match audio and video duration

For the most predictable results, keep audio and video durations close to each other. When durations differ, use the sync_mode parameter to control how the mismatch is handled — cut_off trims audio to match video length, bounce loops the video to match audio, and remap adjusts video speed.

Recommended audio formats: WAV or MP3. All major audio formats are supported — see Media Formats Support for the full list. Sync supports any language for lip sync.

Model Selection for Quality

Each model has different strengths. Choose the right model based on your quality and speed priorities.

Priority	Recommended Model	Why
Most powerful	sync-3	4K native output, built-in obstruction detection, extreme angle support, full-shot processing ( $0.107-$ 0.133/sec)
Premium detail	lipsync-2-pro	Enhanced detail for beards, teeth, and facial features using diffusion-based super resolution ( $0.067-$ 0.083/sec)
Best balance	lipsync-2	Natural lip movements that preserve the speaker’s unique speaking style ( $0.04-$ 0.05/sec)
Expressions	react-1	Adds emotion, facial expressions, and head movement to match audio tone (max 15s clips)

For most use cases, lipsync-2 provides the best balance of quality and speed. Use lipsync-2-pro when you need enhanced detail for beards, teeth, or fine facial features. Use sync-3 for production-grade results — it handles close-ups, profile shots, obstructions, and complex scenes that other models struggle with.

Common Quality Issues and Fixes

Lip movements don't match the audio

Audio-video misalignment typically stems from one of three causes: sync mode configuration, audio quality, or duration mismatch. First, check your sync_mode setting — cut_off mode trims audio that extends beyond the video length, which works well for most cases. If audio and video durations are significantly different, the remap mode adjusts video playback speed to match, but large speed changes can look unnatural. Try cut_off if you are seeing drift. Second, ensure your audio is clean — background music, overlapping speakers, and heavy noise make it harder for the model to align lip movements to the correct speech patterns. Use a noise reduction tool on your audio before submitting. Third, verify that the audio language matches what the model expects; Sync supports all languages, but audio with mixed languages in a single track can cause alignment issues.

Quality degrades in long videos

Long videos are internally divided into 30-40 second chunks for processing. If chunk boundaries fall at points where the face is partially visible, moving rapidly, or absent, the output quality can degrade at those transitions. For the best results with long-form content, consider splitting your video into segments under 2 minutes using the Segments API and processing each segment separately. Use lipsync-2-pro for long-form content where quality is critical — its diffusion-based super resolution handles transitions between chunks more gracefully. Also ensure the speaker’s face is consistently visible and well-lit throughout the entire video. Rapid scene changes within chunks are a common cause of processing timeouts and quality drops.

Face looks distorted or has artifacts

Visual artifacts around the face region usually result from challenging input conditions. Start by ensuring the speaker’s face has good, even lighting without harsh shadows — uneven lighting causes the model to produce inconsistent skin tones across frames. Verify the face is front-facing or near-frontal, as extreme angles produce distortion on lipsync-2 and lipsync-2-pro. Remove any obstructions covering the mouth area, including hands, microphones, or hair. On lipsync-2 and lipsync-2-pro, enable occlusion_detection_enabled for better handling of partially hidden faces. sync-3 handles obstructions and extreme angles automatically. For persistent artifacts, try sync-3 for the most robust results, or lipsync-2-pro for diffusion-based super resolution around teeth, beards, and fine facial features. Check that your input resolution is at least 480p; very low-resolution faces make detection and generation less reliable.

Multiple faces but wrong one is synced

When a video contains multiple faces, Sync’s default behavior selects the most prominent face in the frame. To target a specific person, use the Speaker Selection feature. Speaker selection lets you identify the correct face using automatic detection (auto_detect) or by providing a bounding box or frame number reference. In the API, set the active_speaker_detection option in your generation request. In Sync Studio, use the speaker selection tool in the video player controls. Note that speaker selection is available for lipsync models only — react-1 does not support this feature and requires a single visible speaker. For videos with multiple speakers taking turns, use the Segments API to define time ranges and assign speaker selection per segment.

Output has a watermark

Watermarks appear on videos generated with free or Hobbyist accounts. To remove watermarks, upgrade to the Creator plan or higher — watermark removal is included on all Creator+ plans. Existing videos generated on a free or Hobbyist plan will retain their watermarks; you need to regenerate the video after upgrading to get unwatermarked output. See the Billing page for plan details, pricing, and upgrade instructions. Note that the watermark is applied during generation, not as a post-processing overlay, so it cannot be removed from already-generated videos without re-running the generation on a plan that includes watermark removal.

Support Knowledge Base

For additional lip sync troubleshooting, visit the Sync Support Knowledge Base:

Why is my lip sync not working or showing no mouth movement? — Face detection and input issues
Why is my lip sync not working properly? — Quality and mismatch troubleshooting
No lipsync on AI-generated characters — AI avatar limitations
Speaker selection for multi-person videos — Targeting the correct face