Speaking styles for lip-to-speech synthesis
We release Lip2Wav, a 120-hour single-speaker dataset, and a new architecture that learns individual speaking styles to generate natural speech from lip movements, four times more intelligible than prior work.
When the audio in a conversation cuts out, you don’t go silent in your head. You read the lips and fill in the gap. We do this without thinking about it, and we do it surprisingly well, partly because we already know how the speaker in front of us tends to articulate.
This paper takes that observation seriously. Earlier work on lip-to-speech synthesis tried to learn a generic, anyone-to-anyone mapping from lip motion to speech, which is a hard problem made harder by the fact that lip articulation varies enormously across people. We argued that the right framing is the opposite: learn lip-to-speech accurately for individual speakers in unconstrained, large-vocabulary settings, the way a regular listener would.
The data to do that didn’t exist, so we built it. We collected and released Lip2Wav, the first large-scale benchmark for single-speaker lip-to-speech synthesis: roughly 120 hours of natural talking-head video across five speakers, with diverse vocabulary and unconstrained settings. We then trained a sequence-to-sequence architecture that takes a sequence of lip frames and outputs spectrograms, with a few design choices, the encoder-decoder structure, the visual front-end, the way we handle long sequences, chosen specifically for the single-speaker setting.
Across quantitative metrics, qualitative evaluation, and human ratings, the method is roughly four times more intelligible than prior work, and the dataset itself is now used as the standard benchmark for follow-up research in this space.