T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression
6D pose estimation is the task of predicting the translation and orientation of objects in a given input image, which is a crucial prerequisite for many robotics and augmented reality applications. Lately, the Transformer Network architecture, equipped with multi-head self-attention mechanism, is emerging to achieve state-of-the-art results in many computer vision tasks. DETR, a Transformer-based model, formulated object detection as a set prediction problem and achieved impressive results without standard components like region of interest pooling, non-maximal suppression, and bounding box proposals. In this work, we propose T6D-Direct, a real-time single-stage direct method with a transformer-based architecture built on DETR to perform 6D multi-object pose direct estimation. We evaluate the performance of our method on the YCB-Video dataset. Our method achieves the fastest inference time, and the pose estimation accuracy is comparable to state-of-the-art methods.
- Published in:
DAGM GCPR 2021: Pattern Recognition German Conference on Pattern Recognition (DAGM-GCPR) - Type:
Inproceedings - Authors:
A. Amini, A. S. Periyasamy, S. Behnke - Year:
2021
Citation information
A. Amini, A. S. Periyasamy, S. Behnke: T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, German Conference on Pattern Recognition (DAGM-GCPR), DAGM GCPR 2021: Pattern Recognition, 2021, https://doi.org/10.1007/978-3-030-92659-5_34, Amini.etal.2021,
@Inproceedings{Amini.etal.2021,
author={A. Amini, A. S. Periyasamy, S. Behnke},
title={T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression},
booktitle={German Conference on Pattern Recognition (DAGM-GCPR)},
journal={DAGM GCPR 2021: Pattern Recognition},
url={https://doi.org/10.1007/978-3-030-92659-5_34},
year={2021},
abstract={6D pose estimation is the task of predicting the translation and orientation of objects in a given input image, which is a crucial prerequisite for many robotics and augmented reality applications. Lately, the Transformer Network architecture, equipped with multi-head self-attention mechanism, is emerging to achieve state-of-the-art results in many computer vision tasks. DETR, a Transformer-based...}}