Co-op: Correspondence-based Novel Object Pose Estimation

Abstract

We propose Co-op, a novel method for accurately and robustly estimating the 6DoF pose of objects unseen during training from a single RGB image. Our method requires only the CAD model of the target object and can precisely estimate its pose without any additional fine-tuning. While existing model-based methods suffer from inefficiency due to using a large number of templates, our method enables fast and accurate estimation with a small number of templates. This improvement is achieved by finding semi-dense correspondences between the input image and the pre-rendered templates. Our method achieves strong generalization performance by leveraging a hybrid representation that combines patch-level classification and offset regression. Additionally, our pose refinement model estimates probabilistic flow between the input image and the rendered image, refining the initial estimate to an accurate pose using a differentiable PnP layer. We demonstrate that our method not only estimates object poses rapidly but also outperforms existing methods by a large margin on the seven core datasets of the BOP Challenge, achieving state-of-the-art accuracy.

Video

Method

We estimate object pose through two main stages. In the Coarse Pose Estimation stage, we estimate semi-dense correspondences between the query image and templates and compute the initial pose using PnP. In the Pose Refinement stage, we refine the initial pose by estimating dense flow between the query and rendered images. Both stages utilize transformer encoders and decoders with identical structures, with the Pose Refinement stage additionally incorporating a DPT module after the decoder for dense prediction

Coarse Estimation

Qualitative Results of Coarse Estimation. the correspondences between the query image and the template.

Refinement

Qualitative Results of Pose Refinement. From left to right: query image, initial pose rendering, flow, flow probability, certainty, sensitivity (legend: 0.0 legend 1.0 ). The flow probability and certainty reduce confidence in ambiguous or occluded areas, while sensitivity increases confidence in textured regions and object edges to improve pose refinement.

BibTeX

@inproceedings{moon2025co,
    title={Co-op: Correspondence-based Novel Object Pose Estimation},
    author={Moon, Sungphill and Son, Hyeontae and Hur, Dongcheol and Kim, Sangwook},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2025}
}