Command Palette
Search for a command to run...
Yifan Xu; Mengdan Zhang; Chaoyou Fu; Peixian Chen; Xiaoshan Yang; Ke Li; Changsheng Xu

Abstract
We introduce MQ-Det, an efficient architecture and pre-training strategy design to utilize both textual description with open-set generalization and visual exemplars with rich description granularity as category queries, namely, Multi-modal Queried object Detection, for real-world detection with both open-vocabulary categories and various granularity. MQ-Det incorporates vision queries into existing well-established language-queried-only detectors. A plug-and-play gated class-scalable perceiver module upon the frozen detector is proposed to augment category text with class-wise visual information. To address the learning inertia problem brought by the frozen detector, a vision conditioned masked language prediction strategy is proposed. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors, thus yielding versatile applications. Experimental results demonstrate that multi-modal queries largely boost open-world detection. For instance, MQ-Det significantly improves the state-of-the-art open-set detector GLIP by +7.8% AP on the LVIS benchmark via multi-modal queries without any downstream finetuning, and averagely +6.3% AP on 13 few-shot downstream tasks, with merely additional 3% modulating time required by GLIP. Code is available at https://github.com/YifanXu74/MQ-Det.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| few-shot-object-detection-on-odinw-13 | MQ-GLIP-T | Average Score: 57 |
| few-shot-object-detection-on-odinw-35 | MQ-GLIP-T | Average Score: 43 |
| object-detection-on-odinw-full-shot-13-tasks | MQ-GLIP-L | AP: 71.3 |
| zero-shot-object-detection-on-lvis-v1-0 | MQ-GLIP-L | AP: 43.4 |
| zero-shot-object-detection-on-lvis-v1-0 | MQ-GLIP-T | AP: 30.4 |
| zero-shot-object-detection-on-lvis-v1-0 | MQ-GroundingDINO-T | AP: 30.2 |
| zero-shot-object-detection-on-lvis-v1-0-val | MQ-GLIP-L | AP: 34.7 |
| zero-shot-object-detection-on-lvis-v1-0-val | MQ-GroundingDINO-T | AP: 22.1 |
| zero-shot-object-detection-on-lvis-v1-0-val | MQ-GLIP-T | AP: 22.6 |
| zero-shot-object-detection-on-odinw | MQ-GLIP-L | Average Score: 23.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.