HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Exploring Plain Vision Transformer Backbones for Object Detection

Li Yanghao ; Mao Hanzi ; Girshick Ross ; He Kaiming

Exploring Plain Vision Transformer Backbones for Object Detection

Abstract

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbonenetwork for object detection. This design enables the original ViT architectureto be fine-tuned for object detection without needing to redesign ahierarchical backbone for pre-training. With minimal adaptations forfine-tuning, our plain-backbone detector can achieve competitive results.Surprisingly, we observe: (i) it is sufficient to build a simple featurepyramid from a single-scale feature map (without the common FPN design) and(ii) it is sufficient to use window attention (without shifting) aided withvery few cross-window propagation blocks. With plain ViT backbones pre-trainedas Masked Autoencoders (MAE), our detector, named ViTDet, can compete with theprevious leading methods that were all based on hierarchical backbones,reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1Kpre-training. We hope our study will draw attention to research onplain-backbone detectors. Code for ViTDet is available in Detectron2.

Benchmarks

BenchmarkMethodologyMetrics
cross-domain-few-shot-object-detection-onViTDeT-FT
mAP: 23.4
cross-domain-few-shot-object-detection-on-1ViTDeT-FT
mAP: 25.6
cross-domain-few-shot-object-detection-on-2ViTDeT-FT
mAP: 29.4
cross-domain-few-shot-object-detection-on-3ViTDeT-FT
mAP: 6.5
cross-domain-few-shot-object-detection-on-4ViTDeT-FT
mAP: 15.8
cross-domain-few-shot-object-detection-on-neuViTDeT-FT
mAP: 15.8
instance-segmentation-on-coco-minivalViTDet, ViT-H Cascade
mask AP: 52
instance-segmentation-on-coco-minivalViTDet, ViT-H Cascade (multiscale)
mask AP: 53.1
instance-segmentation-on-lvis-v1-0-valViTDet-H
mask AP: 48.1
mask APr: 36.9
instance-segmentation-on-lvis-v1-0-valViTDet-L
mask AP: 46.0
mask APr: 34.3
object-detection-on-coco-minivalViTDet, ViT-H Cascade
box AP: 60.4
object-detection-on-coco-minivalViTDet, ViT-H Cascade (multiscale)
box AP: 61.3
object-detection-on-coco-oViTDet (ViT-H)
Effective Robustness: 7.89
object-detection-on-coco-oViTDet (ViT-H)
Average mAP: 34.3
object-detection-on-lvis-v1-0-valViTDet-L
box AP: 51.2
object-detection-on-lvis-v1-0-valViTDet-H
box AP: 53.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Exploring Plain Vision Transformer Backbones for Object Detection | Papers | HyperAI