A lip sync expert is all you need

The problem we set out to solve in Wav2Lip was specific: take a talking face video of someone the model has never seen, and re-sync their lips to a new audio track that the original speaker never said. Earlier work could already do this on still images and on speakers seen during training. The moment you handed those models a real, unconstrained video of a stranger, the output collapsed. Sections of the video would drift completely out of sync, and the underlying mouth shapes had no real relationship to the new phonemes.

We traced this back to one architectural choice. Most prior systems trained a lip-sync discriminator alongside the generator, which meant the discriminator was always one step behind, never quite strong enough to penalize bad lip motion. Our fix was to stop training it at all. We pretrained a strong lip-sync expert on a large corpus of real, synced video, a SyncNet-style network, and then froze it. The generator was forced to satisfy a teacher that already knew what good sync looked like, and could not be gamed by easier, blurrier outputs.

The second contribution was about measurement. The benchmarks people were using couldn't actually distinguish a well-synced model from a poorly synced one on real footage, which made progress hard to track. We introduced two metrics, LSE-D (Lip-Sync Error Distance) and LSE-C (Lip-Sync Error Confidence), both computed from the pretrained expert. We also released ReSyncED, a benchmark of real synced videos that lets you compare a model's output to ground truth.

On those benchmarks, Wav2Lip lands close to the accuracy of real synced footage on arbitrary identities. The demo video on the project page makes the gap with prior work obvious in seconds.

Link to Paper