Command Palette
Search for a command to run...
Haotian Zhang Pengchuan Zhang Xiaowei Hu Yen-Chun Chen Liunian Harold Li Xiyang Dai Lijuan Wang Lu Yuan Jenq-Neng Hwang Jianfeng Gao

Abstract
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| object-detection-on-coco | GLIPv2 (CoSwin-H, multi-scale) | box mAP: 62.4 |
| object-detection-on-lvis-v1-0-minival | GLIPv2 | box AP: 59.8 |
| object-detection-on-odinw-full-shot-13-tasks | GLIPv2 | AP: 70.4 |
| phrase-grounding-on-flickr30k-entities-test | GLIPv2 | R@1: 87.7 |
| referring-expression-segmentation-on | GLIPv2 | Mean IoU: 61.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.