The following formats have partial support due to licensing or legal restrictions:
For best compatibility, use MP4 for video and WAV or MP3 for audio.
All output is re-encoded to H.264 using libx264 (-crf 17 -preset slow), regardless of the input codec. Frames are processed in RGB color space internally during generation, so the original bitrate, frame rate, and color grading may differ in the output.
Don’t rely on bitrate for quality preservation: Outputs are re-encoded and bitrate may change.
HDR is not fully supported: HDR videos are normalized to SDR, which may affect color grading in the output.
Alpha channels are removed: The H.264/RGB pipeline does not support transparency. Alpha channels are replaced with a solid background.
For color-sensitive H.264 workflows - especially when compositing generated output back onto source footage - use explicit SDR color metadata and 4:4:4 chroma sampling.
color_space, color_transfer, color_range, and color_primaries. Untagged or partially tagged files can be interpreted differently across decoders. The pipeline uses ffmpeg 7.1 for color metadata detection.yuv444p when color accuracy matters: The pipeline operates in RGB. 4:2:0 and 4:2:2 inputs require chroma upsampling during YUV→RGB conversion, which can cause color shifts or compositing seams. 4:4:4 preserves full chroma resolution through that conversion.Example FFmpeg command for a tagged SDR BT.709 H.264 export:
4K maximum: Videos above 4096×2160 are rejected. Downscale to 4K or below before uploading.
All input codecs are transcoded to a standard format, so processing speed is consistent. Quality loss varies by codec, measured using VMAF:
Direct file uploads are limited to 20 MB. If your file exceeds this limit:
url field of your video or audio input instead of uploading directly.There is no file size limit for URL-based inputs - the file is downloaded from your URL during processing. For production pipelines, URL-based inputs are recommended regardless of file size, as they avoid upload timeouts and are more reliable.
Hosted files must be publicly accessible without authentication headers.
Maximum duration depends on your plan:
Check the pricing page for your plan’s specific limit.
Yes - any aspect ratio is supported, including vertical (9:16), horizontal (16:9), square (1:1), and custom dimensions.
The pipeline extracts the face region at 512×512 for processing, then composites it back into the original frame. The output always matches the input dimensions and orientation.
For best face detection in vertical videos, ensure the speaker’s face is clearly visible and well-lit.