Video Panels for Long Video Understanding
Recent Video-Language Models ({VLMs}) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of {VLMs} by introducing novel modules and additional complexity. \% additional training time. In this paper, we take a different approach: rather than fine-tuning {VLMs} with the limited data available, we attempt to maximize the performance of existing models. To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution. Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing {VLMs}. Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the {TimeScope} (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4{\textbackslash}\%. Overall, our method raises the bar for long video understanding models. We will make our code available upon acceptance.
- Published in:
arXiv - Type:
Article - Authors:
- Year:
2025 - Source:
http://arxiv.org/abs/2509.23724
Citation information
: Video Panels for Long Video Understanding, arXiv, 2025, {arXiv}:2509.23724, September, {arXiv}, http://arxiv.org/abs/2509.23724, Doorenbos.etal.2025a,
@Article{Doorenbos.etal.2025a,
author={Doorenbos, Lars; Spurio, Federico; Gall, Juergen},
title={Video Panels for Long Video Understanding},
journal={arXiv},
number={{arXiv}:2509.23724},
month={September},
publisher={{arXiv}},
url={http://arxiv.org/abs/2509.23724},
year={2025},
abstract={Recent Video-Language Models ({VLMs}) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of {VLMs} by introducing novel modules and additional complexity. \% additional training time. In this paper, we take a different approach:...}}