launch

lipsync 2.0

lipsync-2 is the world's first zero-shot lip sync model that preserves a speaker's unique style without any training or fine-tuning, across live-action, animation, and AI-generated video.

prady prady 2 min read
lipsync 2.0

Quick overview

  • lipsync-2, the most natural video-to-video lip sync model in the world
  • Zero-shot. No actor, clone, or avatar to train before you can use it.
  • Learns each speaker’s unique style and generates with it
  • Works on live-action, animation, and AI-generated humans
  • Use it for video translation, word-level dialogue editing, and character re-animation (including realistic AI UGC)

A whole new model

lipsync-2 is the first zero-shot lipsyncing model that preserves how a specific person speaks, without any extra training or fine-tuning. The model watches the input, builds a style representation of the speaker on the fly, and uses it for every frame it generates.

It’s a step forward across the things that actually matter: realism, expressiveness, control, quality, and speed. Live-action, animation, AI-generated humans, the same model handles all of it.

Features

Zero-shot style preservation. The model picks up speaker style from the input alone, with no separate training pass. Watch it hold Nicolas Cage’s mannerisms across languages, no other zero-shot model does this.

Temperature control. Dial how expressive the lipsync gets, from subtle to extreme.

Active speaker detection. For long videos with multiple people, we built ASD-1, a state-of-the-art active speaker detection pipeline that ties each voice to the right face and only applies lipsync when that person is actually speaking.

Animation that holds up. Pixar-grade animation, AI-generated characters, anything in between. Translation is one use case; the bigger one is editing dialogue freely in post and rethinking what video production looks like.

Record once, edit forever. The take is final the moment you hit stop. lipsync-2 lets you rewrite a line later while keeping the original speaker’s style intact, with no pre-training.

AI Video

When you can generate any video by typing a few sentences, the camera stops being a constraint.

We see ai lip sync as the first surface, not the final one.

This is a strange moment to be making things. A high schooler can shoot a masterpiece on an iPhone. A studio can deliver a film at a tenth of the cost and ten times faster than five years ago. A single video can land in every language on the same day. The goal at sync. is to make video as malleable as text.

Additional Resources

Docs