Command Palette
Search for a command to run...
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions
{Yifeng Shi† Xin Hao∗ Feng Lv∗ Xinliang Wang∗ Chunlong Xia*}

Abstract
Although Vision Transformer (ViT) has achieved significant success in computer vision, it does not perform wellin dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of featurescale. Most existing studies are devoted to designing visionspecific transformers to solve the above problems, which introduce additional pre-training costs. Therefore, we presenta plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction,named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the stateof-the-art, ViT-CoMer has the following advantages: (1) Weinject spatial pyramid multi-receptive field convolutionalfeatures into the ViT architecture, which effectively alleviates the problems of limited local information interactionand single-feature representation in ViT. (2) We proposea simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusionacross hierarchical features, which is beneficial for handling dense prediction tasks. (3) We evaluate the performance of ViT-CoMer across various dense predictiontasks, different frameworks, and multiple advanced pretraining. Notably, our ViT-CoMer-L achieves 64.3% AP onCOCO val2017 without extra training data, and 62.1%mIoU on ADE20K val, both of which are comparable tostate-of-the-art methods. We hope ViT-CoMer can serveas a new backbone for dense prediction tasks to facilitatefuture research. The code will be released at https://github.com/Traffic-X/ViT-CoMer.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| instance-segmentation-on-coco-minival | ViT-CoMer-L (Mask RCNN, DINOv2) | mask AP: 55.9 |
| object-detection-on-coco-minival | ViT-CoMer | Params (M): 363 box AP: 64.3 |
| semantic-segmentation-on-ade20k-val | ViT-CoMer | mIoU: 62.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.