8 months ago

Abstract

Transformers have recently gained increasing attention in computer vision.However, existing studies mostly use Transformers for feature representationlearning, e.g. for image classification and dense predictions, and thegeneralizability of Transformers is unknown. In this work, we furtherinvestigate the possibility of applying Transformers for image matching andmetric learning given pairs of images. We find that the Vision Transformer(ViT) and the vanilla Transformer with decoders are not adequate for imagematching due to their lack of image-to-image attention. Thus, we further designtwo naive solutions, i.e. query-gallery concatenation in ViT, and query-gallerycross-attention in the vanilla Transformer. The latter improves theperformance, but it is still limited. This implies that the attention mechanismin Transformers is primarily designed for global feature aggregation, which isnot naturally suitable for image matching. Accordingly, we propose a newsimplified decoder, which drops the full attention implementation with thesoftmax weighting, keeping only the query-key similarity computation.Additionally, global max pooling and a multilayer perceptron (MLP) head areapplied to decode the matching result. This way, the simplified decoder iscomputationally more efficient, while at the same time more effective for imagematching. The proposed method, called TransMatcher, achieves state-of-the-artperformance in generalizable person re-identification, with up to 6.1% and 5.7%performance gains in Rank-1 and mAP, respectively, on several popular datasets.Code is available at https://github.com/ShengcaiLiao/QAConv.

Source PDF View Code