HyenaPixel: Global Image Context with Convolutions
In computer vision, a larger effective receptive field (ERF) is associated with better performance. While attention natively supports global context, its quadratic complexity limits its applicability to tasks that benefit from high-resolution input. In this work, we extend Hyena, a convolution-based attention replacement, from causal sequences to bidirectional data and two-dimensional image space. We scale Hyena’s convolution kernels beyond the feature map size, up to 191 * 191, to maximize ERF while maintaining sub-quadratic complexity in the number of pixels. We integrate our two-dimensional Hyena, HyenaPixel, and bidirectional Hyena into the MetaFormer framework. For image categorization, HyenaPixel and bidirectional Hyena achieve a competitive ImageNet-1k top-1 accuracy of 84.9\% and 85.2\%, respectively, with no additional training data, while outperforming other convolutional and large-kernel networks. Combining HyenaPixel with attention further improves accuracy. We attribute the success of bidirectional Hyena to learning the data-dependent geometric arrangement of pixels without a fixed neighborhood definition. Experimental results on downstream tasks suggest that HyenaPixel with large filters and a fixed neighborhood leads to better localization performance.
- Published in:
arXiv - Type:
Article - Year:
2024 - Source:
https://arxiv.org/abs/2402.19305
Citation information
: HyenaPixel: Global Image Context with Convolutions, arXiv, 2024, February, https://arxiv.org/abs/2402.19305, Spravil.etal.2024a,
@Article{Spravil.etal.2024a,
author={Spravil, Julian; Houben, Sebastian; Behnke, Sven},
title={HyenaPixel: Global Image Context with Convolutions},
journal={arXiv},
month={February},
url={https://arxiv.org/abs/2402.19305},
year={2024},
abstract={In computer vision, a larger effective receptive field (ERF) is associated with better performance. While attention natively supports global context, its quadratic complexity limits its applicability to tasks that benefit from high-resolution input. In this work, we extend Hyena, a convolution-based attention replacement, from causal sequences to bidirectional data and two-dimensional image...}}