Improving Lip Sync Quality
Lip sync quality depends on the quality of your input video, the clarity of your audio, the model you choose, and how you configure sync mode settings. Following these guidelines will help you get the most natural-looking results from every generation.
Video Input Best Practices
The quality and composition of your input video has the biggest impact on lip sync output. Follow these guidelines for the best results.
Frontal or near-frontal face angles produce the best lip sync results. Extreme profile (side-view) shots make face detection unreliable and can cause distorted output. Aim for the speaker to be facing the camera directly.
Well-lit faces with even lighting produce the cleanest results. Avoid harsh shadows across the face, backlighting that silhouettes the speaker, or flickering light sources that change frame to frame.
Keep the speaker’s mouth and lower face area clear of obstructions. Hands, microphones, hair, sunglasses, and other objects covering the mouth region degrade lip sync accuracy. If obstructions are unavoidable, enable the occlusion_detection_enabled option.
Shaky or jittery footage makes face tracking less reliable. Use a tripod or stabilized camera when possible. If using handheld footage, apply video stabilization in post-production before submitting to Sync.
For the best results, have a single speaker visible in the frame. If your video has multiple faces, use the Speaker Selection feature to target the correct person. Without speaker selection, the model may sync to the wrong face.
Use at least 480p resolution for reliable face detection. Higher resolutions up to 4K (4096x2160) are supported and can improve output quality. We recommend 1080p as the best balance of quality and processing speed. Use MP4 with H.264 codec for optimal compatibility.
Audio Input Best Practices
Clear, well-recorded audio is essential for accurate lip-to-speech alignment.
Background music, crowd noise, and overlapping conversations degrade lip sync accuracy. Isolate the speaker’s voice as much as possible. If your audio has background noise, use a noise reduction tool before submitting. For song lip sync, isolate and upload just the vocals track.
Each audio input should contain a single speaker. For multi-speaker scenarios, use the Segments API to assign different audio tracks to different time ranges, each with its own speaker.
For the most predictable results, keep audio and video durations close to each other. When durations differ, use the sync_mode parameter to control how the mismatch is handled — cut_off trims audio to match video length, bounce loops the video to match audio, and remap adjusts video speed.
Recommended audio formats: WAV or MP3. All major audio formats are supported — see Media Formats Support for the full list. Sync supports any language for lip sync.
Model Selection for Quality
Each model has different strengths. Choose the right model based on your quality and speed priorities.
For most use cases, lipsync-2 provides the best balance of quality and speed. Use lipsync-2-pro when you need the highest possible quality, especially for content with beards, detailed teeth, or fine facial features.
Common Quality Issues and Fixes
Lip movements don't match the audio
Audio-video misalignment typically stems from one of three causes: sync mode configuration, audio quality, or duration mismatch. First, check your sync_mode setting — cut_off mode trims audio that extends beyond the video length, which works well for most cases. If audio and video durations are significantly different, the remap mode adjusts video playback speed to match, but large speed changes can look unnatural. Try cut_off if you are seeing drift. Second, ensure your audio is clean — background music, overlapping speakers, and heavy noise make it harder for the model to align lip movements to the correct speech patterns. Use a noise reduction tool on your audio before submitting. Third, verify that the audio language matches what the model expects; Sync supports all languages, but audio with mixed languages in a single track can cause alignment issues.
Quality degrades in long videos
Long videos are internally divided into 30-40 second chunks for processing. If chunk boundaries fall at points where the face is partially visible, moving rapidly, or absent, the output quality can degrade at those transitions. For the best results with long-form content, consider splitting your video into segments under 2 minutes using the Segments API and processing each segment separately. Use lipsync-2-pro for long-form content where quality is critical — its diffusion-based super resolution handles transitions between chunks more gracefully. Also ensure the speaker’s face is consistently visible and well-lit throughout the entire video. Rapid scene changes within chunks are a common cause of processing timeouts and quality drops.
Face looks distorted or has artifacts
Visual artifacts around the face region usually result from challenging input conditions. Start by ensuring the speaker’s face has good, even lighting without harsh shadows — uneven lighting causes the model to produce inconsistent skin tones across frames. Verify the face is front-facing or near-frontal, as extreme angles produce distortion. Remove any obstructions covering the mouth area, including hands, microphones, or hair. If obstructions are unavoidable, enable occlusion_detection_enabled in your generation request for better handling of partially hidden faces. For persistent artifacts, try lipsync-2-pro, which uses diffusion-based super resolution to produce cleaner output around teeth, beards, and fine facial features. Check that your input resolution is at least 480p; very low-resolution faces make detection and generation less reliable.
Multiple faces but wrong one is synced
When a video contains multiple faces, Sync’s default behavior selects the most prominent face in the frame. To target a specific person, use the Speaker Selection feature. Speaker selection lets you identify the correct face using automatic detection (auto_detect) or by providing a bounding box or frame number reference. In the API, set the active_speaker_detection option in your generation request. In Sync Studio, use the speaker selection tool in the video player controls. Note that speaker selection is available for lipsync models only — react-1 does not support this feature and requires a single visible speaker. For videos with multiple speakers taking turns, use the Segments API to define time ranges and assign speaker selection per segment.
Output has a watermark
Watermarks appear on videos generated with free or Hobbyist accounts. To remove watermarks, upgrade to the Creator plan or higher — watermark removal is included on all Creator+ plans. Existing videos generated on a free or Hobbyist plan will retain their watermarks; you need to regenerate the video after upgrading to get unwatermarked output. See the Billing page for plan details, pricing, and upgrade instructions. Note that the watermark is applied during generation, not as a post-processing overlay, so it cannot be removed from already-generated videos without re-running the generation on a plan that includes watermark removal.
Support Knowledge Base
For additional lip sync troubleshooting, visit the Sync Support Knowledge Base:
- Why is my lip sync not working or showing no mouth movement? — Face detection and input issues
- Why is my lip sync not working properly? — Quality and mismatch troubleshooting
- No lipsync on AI-generated characters — AI avatar limitations
- Speaker selection for multi-person videos — Targeting the correct face

