Command Palette
Search for a command to run...

Abstract
This paper presents a grounded language-image pre-training (GLIP) model forlearning object-level, language-aware, and semantic-rich visualrepresentations. GLIP unifies object detection and phrase grounding forpre-training. The unification brings two benefits: 1) it allows GLIP to learnfrom both detection and grounding data to improve both tasks and bootstrap agood grounding model; 2) GLIP can leverage massive image-text pairs bygenerating grounding boxes in a self-training fashion, making the learnedrepresentation semantic-rich. In our experiments, we pre-train GLIP on 27Mgrounding data, including 3M human-annotated and 24M web-crawled image-textpairs. The learned representations demonstrate strong zero-shot and few-shottransferability to various object-level recognition tasks. 1) When directlyevaluated on COCO and LVIS (without seeing any images in COCO duringpre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing manysupervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on valand 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervisedDynamic Head. Code is released at https://github.com/microsoft/GLIP.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| 2d-object-detection-on-rf100 | GLIP | Average mAP: 0.112 |
| described-object-detection-on-description | GLIP-T | Intra-scenario ABS mAP: 21.5 Intra-scenario FULL mAP: 19.1 Intra-scenario PRES mAP: 18.3 |
| few-shot-object-detection-on-odinw-13 | GLIP-T | Average Score: 50.7 |
| few-shot-object-detection-on-odinw-35 | GLIP-T | Average Score: 38.9 |
| object-detection-on-coco | GLIP (Swin-L, multi-scale) | AP50: 79.5 AP75: 67.7 APL: 75.0 APM: 64.9 APS: 45.3 box mAP: 61.5 |
| object-detection-on-coco-minival | GLIP (Swin-L, multi-scale) | box AP: 60.8 |
| object-detection-on-coco-o | GLIP-L (Swin-L) | Average mAP: 48.0 Effective Robustness: 24.89 |
| object-detection-on-coco-o | GLIP-T (Swin-T) | Average mAP: 29.1 Effective Robustness: 8.11 |
| object-detection-on-odinw-full-shot-13-tasks | GLIP | AP: 68.9 |
| phrase-grounding-on-flickr30k-entities-test | GLIP | R@1: 87.1 R@10: 98.1 R@5: 96.9 |
| zero-shot-object-detection-on-lvis-v1-0 | GLIP-L | AP: 37.3 |
| zero-shot-object-detection-on-lvis-v1-0-val | GLIP-L | AP: 26.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.