{SyncVP}: Joint Diffusion for Synchronous Multi-Modal Video Prediction
Predicting future video frames is essential for decision-making systems, yet {RGB} frames alone often lack the information needed to fully capture the underlying complexities of the real world. To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction ({SyncVP}) that incorporates complementary data modalities, enhancing the richness and accuracy of future predictions. {SyncVP} builds on pre-trained modality-specific diffusion models and introduces an efficient spatio-temporal cross-attention module to enable effective information sharing across modalities. We evaluate {SyncVP} on standard benchmark datasets, such as Cityscapes and {BAIR}, using depth as an additional modality. We furthermore demonstrate its generalization to other modalities on {SYNTHIA} with semantic information and {ERA}5-Land with climate data. Notably, {SyncVP} achieves state-of-the-art performance, even in scenarios where only one modality is present, demonstrating its robustness and potential for a wide range of applications.
- Veröffentlicht in:
arXiv - Typ:
Inproceedings - Autoren:
Pallotta, Enrico; Azar, Sina Mokhtarzadeh; Li, Shuai; Zatsarynna, Olga; Gall, Juergen - Jahr:
2025 - Source:
http://arxiv.org/abs/2503.18933
Informationen zur Zitierung
Pallotta, Enrico; Azar, Sina Mokhtarzadeh; Li, Shuai; Zatsarynna, Olga; Gall, Juergen: {SyncVP}: Joint Diffusion for Synchronous Multi-Modal Video Prediction, arXiv, 2025, {arXiv}:2503.18933, March, {arXiv}, http://arxiv.org/abs/2503.18933, Pallotta.etal.2025a,
@Inproceedings{Pallotta.etal.2025a,
author={Pallotta, Enrico; Azar, Sina Mokhtarzadeh; Li, Shuai; Zatsarynna, Olga; Gall, Juergen},
title={{SyncVP}: Joint Diffusion for Synchronous Multi-Modal Video Prediction},
booktitle={arXiv},
number={{arXiv}:2503.18933},
month={March},
publisher={{arXiv}},
url={http://arxiv.org/abs/2503.18933},
year={2025},
abstract={Predicting future video frames is essential for decision-making systems, yet {RGB} frames alone often lack the information needed to fully capture the underlying complexities of the real world. To address this limitation, we propose a multi-modal framework for Synchronous Video Prediction ({SyncVP}) that incorporates complementary data modalities, enhancing the richness and accuracy of future...}}