Efficient Multi-Object Pose Estimation using Multi-Resolution Deformable Attention and Query Aggregation
Object pose estimation is a long-standing problem in computer vision. Recently, attention-based vision transformer models have achieved state-of-the-art results in many computer vision applications. Exploiting the permutation-invariant nature of the attention mechanism, a family of vision transformer models formulate multi-object pose estimation as a set prediction problem. However, existing vision transformer models for multi-object pose estimation rely exclusively on the attention mechanism. Convolutional neural networks, on the other hand, hard-wire various inductive biases into their architecture. In this paper, we investigate incorporating inductive biases in vision transformer models for multi-object pose estimation, which facilitates learning long-range dependencies while circumventing the costly global attention. In particular, we use multi-resolution deformable attention, where the attention operation is performed only between a few deformed reference points. Furthermore, we propose a query aggregation mechanism that enables increasing the number of object queries without increasing the computational complexity. We evaluate the proposed model on the challenging YCB-Video dataset and report state-of-the-art results.
- Published in:
IEEE International Conference on Robotic Computing - Type:
Inproceedings - Authors:
Periyasamy, Arul Selvam; Tsaturyan, Vladimir; Behnke, Sven - Year:
2023
Citation information
Periyasamy, Arul Selvam; Tsaturyan, Vladimir; Behnke, Sven: Efficient Multi-Object Pose Estimation using Multi-Resolution Deformable Attention and Query Aggregation, IEEE International Conference on Robotic Computing, 2023, November, https://ais.uni-bonn.de/papers/IRC_2023_Periyasamy.pdf, Periyasamy.etal.2023b,
@Inproceedings{Periyasamy.etal.2023b,
author={Periyasamy, Arul Selvam; Tsaturyan, Vladimir; Behnke, Sven},
title={Efficient Multi-Object Pose Estimation using Multi-Resolution Deformable Attention and Query Aggregation},
booktitle={IEEE International Conference on Robotic Computing},
month={November},
url={https://ais.uni-bonn.de/papers/IRC_2023_Periyasamy.pdf},
year={2023},
abstract={Object pose estimation is a long-standing problem in computer vision. Recently, attention-based vision transformer models have achieved state-of-the-art results in many computer vision applications. Exploiting the permutation-invariant nature of the attention mechanism, a family of vision transformer models formulate multi-object pose estimation as a set prediction problem. However, existing...}}