Transformer guided geometry model for flow-based unsupervised visual odometry

Publication Name

Neural Computing and Applications


Existing unsupervised visual odometry (VO) methods either match pairwise images or integrate the temporal information using recurrent neural networks over a long sequence of images. They are either not accurate, time-consuming in training or error accumulative. In this paper, we propose a method consisting of two camera pose estimators that deal with the information from pairwise images and a short sequence of images, respectively. For image sequences, a transformer-like structure is adopted to build a geometry model over a local temporal window, referred to as transformer-based auxiliary pose estimator (TAPE). Meanwhile, a flow-to-flow pose estimator (F2FPE) is proposed to exploit the relationship between pairwise images. The two estimators are constrained through a simple yet effective consistency loss in training. Empirical evaluation has shown that the proposed method outperforms the state-of-the-art unsupervised learning-based methods by a large margin and performs comparably to supervised and traditional ones on the KITTI and Malaga dataset.

Open Access Status

This publication is not available as open access

Funding Number


Funding Sponsor

National Natural Science Foundation of China



Link to publisher version (DOI)