Improving Lip Sync Quality

Lip sync quality depends on the quality of your input video, the clarity of your audio, the model you choose, and how you configure sync mode settings. Following these guidelines will help you get the most natural-looking results from every generation.

Video Input Best Practices

The quality and composition of your input video has the biggest impact on lip sync output. Follow these guidelines for the best results.

Use a front-facing camera angle

Frontal or near-frontal face angles produce the best lip sync results. Extreme profile (side-view) shots make face detection unreliable and can cause distorted output. Aim for the speaker to be facing the camera directly.

Ensure good, even lighting

Well-lit faces with even lighting produce the cleanest results. Avoid harsh shadows across the face, backlighting that silhouettes the speaker, or flickering light sources that change frame to frame.

Minimize face obstructions

Keep the speaker’s mouth and lower face area clear of obstructions. Hands, microphones, hair, sunglasses, and other objects covering the mouth region degrade lip sync accuracy. If obstructions are unavoidable, enable the occlusion_detection_enabled option.

Use stable footage

Shaky or jittery footage makes face tracking less reliable. Use a tripod or stabilized camera when possible. If using handheld footage, apply video stabilization in post-production before submitting to Sync.

Keep one face in frame

For the best results, have a single speaker visible in the frame. If your video has multiple faces, use the Speaker Selection feature to target the correct person. Without speaker selection, the model may sync to the wrong face.

Meet resolution requirements

Use at least 480p resolution for reliable face detection. Higher resolutions up to 4K (4096x2160) are supported and can improve output quality. We recommend 1080p as the best balance of quality and processing speed. Use MP4 with H.264 codec for optimal compatibility.

Audio Input Best Practices

Clear, well-recorded audio is essential for accurate lip-to-speech alignment.

Use clean speech without background noise

Background music, crowd noise, and overlapping conversations degrade lip sync accuracy. Isolate the speaker’s voice as much as possible. If your audio has background noise, use a noise reduction tool before submitting. For song lip sync, isolate and upload just the vocals track.

One speaker per audio track

Each audio input should contain a single speaker. For multi-speaker scenarios, use the Segments API to assign different audio tracks to different time ranges, each with its own speaker.

Match audio and video duration

For the most predictable results, keep audio and video durations close to each other. When durations differ, use the sync_mode parameter to control how the mismatch is handled — cut_off trims audio to match video length, bounce loops the video to match audio, and remap adjusts video speed.

Recommended audio formats: WAV or MP3. All major audio formats are supported — see Media Formats Support for the full list. Sync supports any language for lip sync.

Model Selection for Quality

Each model has different strengths. Choose the right model based on your quality and speed priorities.

PriorityRecommended ModelWhy
Best qualitylipsync-2-proEnhanced detail for beards, teeth, and facial features using diffusion-based super resolution (0.0670.067-0.083/sec)
Best balancelipsync-2Natural lip movements that preserve the speaker’s unique speaking style (0.040.04-0.05/sec)
Fastestlipsync-1.9.0-betaGood for simple videos and high-volume workloads, fastest processing (0.020.02-0.025/sec)
Expressionsreact-1Adds emotion, facial expressions, and head movement to match audio tone (max 15s clips)

For most use cases, lipsync-2 provides the best balance of quality and speed. Use lipsync-2-pro when you need the highest possible quality, especially for content with beards, detailed teeth, or fine facial features.

Common Quality Issues and Fixes

Audio-video misalignment typically stems from one of three causes: sync mode configuration, audio quality, or duration mismatch. First, check your sync_mode setting — cut_off mode trims audio that extends beyond the video length, which works well for most cases. If audio and video durations are significantly different, the remap mode adjusts video playback speed to match, but large speed changes can look unnatural. Try cut_off if you are seeing drift. Second, ensure your audio is clean — background music, overlapping speakers, and heavy noise make it harder for the model to align lip movements to the correct speech patterns. Use a noise reduction tool on your audio before submitting. Third, verify that the audio language matches what the model expects; Sync supports all languages, but audio with mixed languages in a single track can cause alignment issues.

Long videos are internally divided into 30-40 second chunks for processing. If chunk boundaries fall at points where the face is partially visible, moving rapidly, or absent, the output quality can degrade at those transitions. For the best results with long-form content, consider splitting your video into segments under 2 minutes using the Segments API and processing each segment separately. Use lipsync-2-pro for long-form content where quality is critical — its diffusion-based super resolution handles transitions between chunks more gracefully. Also ensure the speaker’s face is consistently visible and well-lit throughout the entire video. Rapid scene changes within chunks are a common cause of processing timeouts and quality drops.

Visual artifacts around the face region usually result from challenging input conditions. Start by ensuring the speaker’s face has good, even lighting without harsh shadows — uneven lighting causes the model to produce inconsistent skin tones across frames. Verify the face is front-facing or near-frontal, as extreme angles produce distortion. Remove any obstructions covering the mouth area, including hands, microphones, or hair. If obstructions are unavoidable, enable occlusion_detection_enabled in your generation request for better handling of partially hidden faces. For persistent artifacts, try lipsync-2-pro, which uses diffusion-based super resolution to produce cleaner output around teeth, beards, and fine facial features. Check that your input resolution is at least 480p; very low-resolution faces make detection and generation less reliable.

When a video contains multiple faces, Sync’s default behavior selects the most prominent face in the frame. To target a specific person, use the Speaker Selection feature. Speaker selection lets you identify the correct face using automatic detection (auto_detect) or by providing a bounding box or frame number reference. In the API, set the active_speaker_detection option in your generation request. In Sync Studio, use the speaker selection tool in the video player controls. Note that speaker selection is available for lipsync models only — react-1 does not support this feature and requires a single visible speaker. For videos with multiple speakers taking turns, use the Segments API to define time ranges and assign speaker selection per segment.

Watermarks appear on videos generated with free or Hobbyist accounts. To remove watermarks, upgrade to the Creator plan or higher — watermark removal is included on all Creator+ plans. Existing videos generated on a free or Hobbyist plan will retain their watermarks; you need to regenerate the video after upgrading to get unwatermarked output. See the Billing page for plan details, pricing, and upgrade instructions. Note that the watermark is applied during generation, not as a post-processing overlay, so it cannot be removed from already-generated videos without re-running the generation on a plan that includes watermark removal.

Support Knowledge Base

For additional lip sync troubleshooting, visit the Sync Support Knowledge Base: