Video Panels for Long Video Understanding

Recent Video-Language Models ({VLMs}) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of {VLMs} by introducing novel modules and additional complexity. \% additional training time. In this paper, we take a different approach: rather than fine-tuning {VLMs} with the limited data available, we attempt to maximize the performance of existing models. To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution. Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing {VLMs}. Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the {TimeScope} (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4{\textbackslash}\%. Overall, our method raises the bar for long video understanding models. We will make our code available upon acceptance.

Citation information

Doorenbos, Lars; Spurio, Federico; Gall, Juergen: Video Panels for Long Video Understanding, arXiv, 2025, {arXiv}:2509.23724, September, {arXiv}, http://arxiv.org/abs/2509.23724, Doorenbos.etal.2025a,