Efficient CNNs and Transformers for Video Understanding and Image Synthesis
In this talk, I will first discuss approaches that reduce the GFLOPs during inference for 3D convolutional neural networks (CNN) and vision transformers. While state-of-the-art 3D CNNs and vision transformers achieve very good results on action recognition datasets, they are computationally very expensive and require many GFLOPs. While the GFLOPs of a 3D CNN or vision transformer can be decreased by reducing the temporal feature resolution or the number of tokens, there is no setting that is optimal for all input clips. I will therefore discuss two differentiable sampling approaches that can be plugged into any existing 3D CNN or vision transformer architecture. The sampling approaches adapt the computational resources to the input video such that as much resources as needed but not more than necessary are used to classify a video. The approaches substantially reduce the computational cost (GFLOPs) of state-of-the-art networks while preserving the accuracy. In the second part, I will discuss an approach that generates annotated training samples of very rare classes. It is based on a generative adversarial network (GAN) that jointly synthesizes images and the corresponding segmentation mask for each image. The generated data can then be used for one-shot video object segmentation.
- Published in:
ACM International Conference on Multimedia Retrieval - Type:
Inproceedings - Authors:
Gall, Jürgen - Year:
2023
Citation information
Gall, Jürgen: Efficient CNNs and Transformers for Video Understanding and Image Synthesis, ACM International Conference on Multimedia Retrieval, 2023, https://dl.acm.org/doi/10.1145/3591106.3592300, Gall.2023a,
@Inproceedings{Gall.2023a,
author={Gall, Jürgen},
title={Efficient CNNs and Transformers for Video Understanding and Image Synthesis},
booktitle={ACM International Conference on Multimedia Retrieval},
url={https://dl.acm.org/doi/10.1145/3591106.3592300},
year={2023},
abstract={In this talk, I will first discuss approaches that reduce the GFLOPs during inference for 3D convolutional neural networks (CNN) and vision transformers. While state-of-the-art 3D CNNs and vision transformers achieve very good results on action recognition datasets, they are computationally very expensive and require many GFLOPs. While the GFLOPs of a 3D CNN or vision transformer can be decreased...}}