Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos

1 University of Washington 2 MIT-IBM Watson AI Lab 3 MIT 4 Umass Amherst
CVPR 2023

Our approach can synthesize high-fidelity impact sound by taking physics priors and silent video as inputs.


Abstract

Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely available in the real world and can not be applied to synthesize impact sounds from common videos. On the other hand, existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds since they lack of physics knowledge. In this work, we propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip. In addition to the video content, we propose to use additional physics priors to guide the impact sound synthesis procedure. The physics priors include both physics parameters that are directly estimated from noisy real-world impact sound examples without sophisticated setup and learned residual parameters that interpret the sound environment via neural networks. We further implement a novel diffusion model with specific training and inference strategies to combine physics priors and visual information for impact sound synthesis. Experimental results show that our model outperforms several existing systems in generating realistic impact sounds. More importantly, the physics-based representations are fully interpretable and transparent, thus enabling us to perform sound editing flexibly. We encourage the readers to watch the supplementary video with audio turned on to experience the results.


Reconstruct Physics Priors From Sound

We reconstruct physics priors from sound by two components: 1) We estimate a set of physics parameters (frequency, power, and decay rate) from audio waveform via signal processing techniques, and 2) We predict residual parameters by learning to encode the environment information using a transformer encoder. A reconstruction loss is applied to optimize all trainable modules.


Physics-Driven Diffusion Models

Overview of physics-driven diffusion model for impact sound synthesis from videos. (left) During training, we reconstruct physics priors from audio samples and encode them into a physics latent. Besides, we use a visual encoder to extract visual latent from the video input. We apply these two latents as conditional inputs to the U-Net spectrogram denoiser. (right) During testing, we extract the visual latent from test video and use it to query a physics latent from the key-value pairs of visual and physics latents in the training set. Finally, the physics and visual latents are used as conditional inputs to the denoiser and the denoiser interatively generates the impact sound spectrogram.

Results & Applications


Impact Sound Generation from In the Wild Videos

Applying Same Physics Priors to Various Videos

Material: Cloth

Material: Glass

Material: Metal

Editing Impact Sound via Removing low frequency Physics Priors


Citation


@article{kun2023physicsdiff,
            title={Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos},
            author={Kun Su and Kaizhi Qian and Eli Shlizerman and Antonio Torralba and Chuang Gan},
            journal={CVPR},
            year={2023}
}

Acknowledgements

We would like to thank the authors of the Greatest Hits dataset for making this dataset possible. We would like to thank Vinayak Agarwal for his suggestions on physics mode parameters estimation from raw audio. We would like to thank the authors of DiffImpact for inspiring us to use the physics-based sound synthesis method to design physics priors as a conditional signal to guide the deep generative model synthesizes impact sounds from videos.