8 months ago

Abstract

Deriving reliable region-word alignment from image-text pairs is critical tolearn object-level vision-language representations for open-vocabulary objectdetection. Existing methods typically rely on pre-trained or self-trainedvision-language models for alignment, which are prone to limitations inlocalization accuracy or generalization capabilities. In this paper, we proposeCoDet, a novel approach that overcomes the reliance on pre-alignedvision-language space by reformulating region-word alignment as a co-occurringobject discovery problem. Intuitively, by grouping images that mention a sharedconcept in their captions, objects corresponding to the shared concept shallexhibit high co-occurrence among the group. CoDet then leverages visualsimilarities to discover the co-occurring objects and align them with theshared concept. Extensive experiments demonstrate that CoDet has superiorperformances and compelling scalability in open-vocabulary detection, e.g., byscaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$ . Code is available athttps://github.com/CVMI-Lab/CoDet.