Command Palette
Search for a command to run...
Zhe Chen Yuchen Duan Wenhai Wang Junjun He Tong Lu Jifeng Dai Yu Qiao

Abstract
This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| instance-segmentation-on-coco | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | mask AP: 53.0 |
| instance-segmentation-on-coco | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | mask AP: 52.5 |
| instance-segmentation-on-coco | ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale) | mask AP: 54.5 |
| instance-segmentation-on-coco-minival | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | mask AP: 52.2 |
| instance-segmentation-on-coco-minival | ViT-Adapter-L (HTC++, BEiTv2, O365, multi-scale) | mask AP: 54.2 |
| instance-segmentation-on-coco-minival | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | mask AP: 52.5 |
| object-detection-on-coco | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | box mAP: 60.4 |
| object-detection-on-coco | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | box mAP: 60.9 |
| object-detection-on-coco-minival | ViT-Adapter-L (HTC++, BEiT pretrain, multi-scale) | box AP: 60.2 |
| object-detection-on-coco-minival | ViT-Adapter-L (HTC++, BEiTv2 pretrain, multi-scale) | box AP: 60.5 |
| object-detection-on-coco-o | ViT-Adapter (BEiTv2-L) | Average mAP: 34.25 Effective Robustness: 7.79 |
| panoptic-segmentation-on-coco-minival | ViT-Adapter-L (single-scale, BEiTv2 pretrain, Mask2Former) | AP: 48.9 PQ: 58.4 PQst: 48.4 PQth: 65.0 |
| semantic-segmentation-on-ade20k | ViT-Adapter-L (UperNet, BEiT pretrain) | Params (M): 451 Validation mIoU: 58.4 |
| semantic-segmentation-on-ade20k | ViT-Adapter-L (Mask2Former, BEiT pretrain) | Params (M): 571 Validation mIoU: 60.5 |
| semantic-segmentation-on-ade20k | ViT-Adapter-L (Mask2Former, BEiTv2 pretrain) | Params (M): 571 Validation mIoU: 61.5 |
| semantic-segmentation-on-ade20k-val | ViT-Adapter-L (UperNet, BEiT pretrain) | mIoU: 58.4 |
| semantic-segmentation-on-ade20k-val | ViT-Adapter-L (Mask2Former, BEiT pretrain) | mIoU: 60.5 |
| semantic-segmentation-on-cityscapes | ViT-Adapter-L (Mask2Former, BEiT pretrain) | Mean IoU (class): 85.2% |
| semantic-segmentation-on-cityscapes-val | ViT-Adapter-L | mIoU: 85.8 |
| semantic-segmentation-on-pascal-context | ViT-Adapter-L (Mask2Former, BEiT pretrain) | mIoU: 68.2 |
| semantic-segmentation-on-pascal-context | ViT-Adapter-L (UperNet, BEiT pretrain) | mIoU: 67.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.