HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Hao Wang; Pengzhen Ren; Zequn Jie; Xiao Dong; Chengjian Feng; Yinlong Qian; Lin Ma; Dongmei Jiang; Yaowei Wang; Xiangyuan Lan; Xiaodan Liang

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Abstract

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at https://github.com/wanghao9610/OV-DINO.

Code Repositories

wanghao9610/ov-dino
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
zero-shot-object-detection-on-lvis-v1-0OV-DINO-T (without LVIS data, swin tiny)
AP: 40.1
zero-shot-object-detection-on-lvis-v1-0-valOV-DINO-T (without LVIS data, swin tiny)
AP: 32.9
zero-shot-object-detection-on-mscocoOV-DINO-T (without COCO data)
AP: 50.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion | Papers | HyperAI