lipsync 2.0
lipsync-2 is the world's first zero-shot lip sync model that preserves a speaker's unique style without any training or fine-tuning, across live-action, animation, and AI-generated video.
Quick overview
- lipsync-2, the most natural video-to-video lip sync model in the world
- Zero-shot. No actor, clone, or avatar to train before you can use it.
- Learns each speaker’s unique style and generates with it
- Works on live-action, animation, and AI-generated humans
- Use it for video translation, word-level dialogue editing, and character re-animation (including realistic AI UGC)
A whole new model
lipsync-2 is the first zero-shot lipsyncing model that preserves how a specific person speaks, without any extra training or fine-tuning. The model watches the input, builds a style representation of the speaker on the fly, and uses it for every frame it generates.
It’s a step forward across the things that actually matter: realism, expressiveness, control, quality, and speed. Live-action, animation, AI-generated humans, the same model handles all of it.
Features
Zero-shot style preservation. The model picks up speaker style from the input alone, with no separate training pass. Watch it hold Nicolas Cage’s mannerisms across languages, no other zero-shot model does this.
Temperature control. Dial how expressive the lipsync gets, from subtle to extreme.
Active speaker detection. For long videos with multiple people, we built ASD-1, a state-of-the-art active speaker detection pipeline that ties each voice to the right face and only applies lipsync when that person is actually speaking.
Animation that holds up. Pixar-grade animation, AI-generated characters, anything in between. Translation is one use case; the bigger one is editing dialogue freely in post and rethinking what video production looks like.
Record once, edit forever. The take is final the moment you hit stop. lipsync-2 lets you rewrite a line later while keeping the original speaker’s style intact, with no pre-training.
AI Video
When you can generate any video by typing a few sentences, the camera stops being a constraint.
We see ai lip sync as the first surface, not the final one.
This is a strange moment to be making things. A high schooler can shoot a masterpiece on an iPhone. A studio can deliver a film at a tenth of the cost and ten times faster than five years ago. A single video can land in every language on the same day. The goal at sync. is to make video as malleable as text.
Additional Resources