Media Formats Support
Supported Media Formats
Video Formats
The Sync API accepts the following video file formats:
Audio Formats
Full Support
The Sync API fully supports the following audio file formats:
Limited Support
The following audio formats have partial support due to licensing, patent, or legal restrictions:
File Format Recommendation: While multiple formats are supported, we recommend using MP4 for video and WAV or MP3 for audio to ensure optimal compatibility and processing performance.
Output Quality
Video Processing Overview
The Sync video pipeline uses the H.264 codec for internal processing, and all videos are re-encoded. While we strive to preserve the input video’s quality and properties, this process may change properties like the original codec, bitrate, and frame rate.
A Note on HDR Video: 10-bit color space (HDR) videos are not fully supported. HDR videos will be normalized to 8-bit color space (SDR), which may cause changes to the color grading in the output.
A Note on Alpha Transparency: Alpha channels are not preserved in the output. The Sync pipeline uses H.264 codec and processes video in RGB color space, which does not support alpha channels. If your input video contains alpha transparency (such as WebP videos with transparency), the alpha channel will be removed and replaced with a solid background.
Recommended Input Properties
Video
Maximum Resolution Limit: Input videos above 4K (4096 x 2160 pixels) are not supported and will be rejected. If you need to process higher resolution content, downscale your video to 4K or below before uploading.
Audio
For the best results, use a sampling rate of 44.1kHz or 48kHz. If you provide audio with a higher sampling rate, it will be downsampled to 48kHz during lipsync, which can result in quality loss.
The Sync API supports audio with up to 32-bit float bit-depth and up to 7.1 channels. Spatial audio formats are not supported.
If an input file contains multiple audio streams, only the first stream is processed. All other streams are discarded.
Input Video Codec Comparison
Processing speed is similar for all codecs because every input is transcoded to a standard format. However, some codecs experience greater quality loss during this process.
The following results are from our internal testing, where quality was measured using VMAF.
Frequently Asked Questions
What is the maximum file size for uploads?
Direct file uploads to the Sync API are limited to 20 MB. This applies when you use the create-with-files endpoint to upload video or audio files directly in the request body. If your file exceeds 20 MB, host it at a publicly accessible URL (such as an S3 bucket, CDN, or any web server) and pass the URL in the url field of your video or audio input instead. There is no file size limit when using URL-based inputs — Sync downloads the file from your URL during processing. For large batch workloads, the URL-based approach is recommended regardless of file size, as it avoids upload timeouts and is more reliable for production pipelines. Make sure your hosted files are publicly accessible without authentication headers, as Sync’s servers need to fetch the file directly.
What is the maximum video duration?
The maximum video duration depends on your subscription plan. Free accounts can process videos up to 20 seconds long. Paid plans support significantly longer videos — the maximum duration increases with each tier, ranging from 1 minute on Hobbyist up to 30 minutes on Scale+ plans. Check the pricing page for your plan’s specific duration limit. Note that react-1 has a separate hard limit of 15 seconds regardless of your plan, as it is designed for short-form expressive content. For videos that exceed your plan’s maximum duration, consider splitting them into shorter segments using the Segments API and processing each segment separately. Longer videos also take proportionally longer to process — see our Generation Times guide for expected processing times by model and duration.
Does Sync support vertical/portrait videos?
Yes, Sync processes videos in any aspect ratio including vertical (9:16), horizontal (16:9), square (1:1), and any custom aspect ratio. The lip sync pipeline extracts the face region from the video at 512x512 resolution for processing regardless of the input video’s dimensions or orientation. The processed face is then composited back into the original frame, preserving the original aspect ratio and resolution in the output. This means vertical videos from mobile phones, square videos for social media, and standard widescreen footage all work equally well. The output video maintains the same dimensions as your input. For the best face detection results, ensure the speaker’s face occupies a reasonable portion of the frame — in vertical videos where the face may be smaller relative to the full frame, make sure the face is still clearly visible and well-lit.
Related Resources
- Media Content Tips — best practices for preparing your video and audio content for optimal lip sync results
- Lipsync Model — learn about supported models and their input requirements
- Quickstart — get started with your first Sync generation

