Audio-Visual Face Reenactment

Most prior work on talking head video treats the face as a single, monolithic prediction problem: drive a still image with a video and let the network figure out the rest. That works for head pose and expression, but the mouth almost always gives the result away. Audio is the cleanest signal for what the lips should be doing, and most systems were ignoring it.

The method in this paper splits the problem into pieces that can each be solved well. Head motion and expression are transferred from a driving video using a dense motion field built on learnable keypoints, in the spirit of the First Order Motion Model and its successors. The mouth region is conditioned separately on audio, which forces the network to attend to the lips rather than hallucinating shape from neighboring pixels. To prevent the warped motion field from distorting facial structure, we add priors from face segmentation and a face mesh, which keep the underlying geometry consistent.

The final stage is the generator. It takes the source image together with the warped motion features and produces the output frame through an identity-aware module, the component responsible for preserving the speaker-specific details that make a face recognizable: skin texture, contours, the small features that the brain uses to confirm identity at a glance.

The method generalizes to unseen faces, languages, and voices, and outperforms prior work across both quantitative and qualitative metrics. One application worth highlighting is low-bandwidth video calls: send the audio and a single keyframe, and reconstruct the rest at the receiver.

Demo and additional information are available here.

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 5178-5187

Link to Paper