Command Palette
Search for a command to run...
CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection
Ma Chuofan ; Jiang Yi ; Wen Xin ; Yuan Zehuan ; Qi Xiaojuan

Abstract
Deriving reliable region-word alignment from image-text pairs is critical tolearn object-level vision-language representations for open-vocabulary objectdetection. Existing methods typically rely on pre-trained or self-trainedvision-language models for alignment, which are prone to limitations inlocalization accuracy or generalization capabilities. In this paper, we proposeCoDet, a novel approach that overcomes the reliance on pre-alignedvision-language space by reformulating region-word alignment as a co-occurringobject discovery problem. Intuitively, by grouping images that mention a sharedconcept in their captions, objects corresponding to the shared concept shallexhibit high co-occurrence among the group. CoDet then leverages visualsimilarities to discover the co-occurring objects and align them with theshared concept. Extensive experiments demonstrate that CoDet has superiorperformances and compelling scalability in open-vocabulary detection, e.g., byscaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2$\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available athttps://github.com/CVMI-Lab/CoDet.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| open-vocabulary-object-detection-on-lvis-v1-0 | CoDet (EVA02-L) | AP novel-LVIS base training: 37.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.