HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Grounded Language-Image Pre-training

Grounded Language-Image Pre-training

Abstract

This paper presents a grounded language-image pre-training (GLIP) model forlearning object-level, language-aware, and semantic-rich visualrepresentations. GLIP unifies object detection and phrase grounding forpre-training. The unification brings two benefits: 1) it allows GLIP to learnfrom both detection and grounding data to improve both tasks and bootstrap agood grounding model; 2) GLIP can leverage massive image-text pairs bygenerating grounding boxes in a self-training fashion, making the learnedrepresentation semantic-rich. In our experiments, we pre-train GLIP on 27Mgrounding data, including 3M human-annotated and 24M web-crawled image-textpairs. The learned representations demonstrate strong zero-shot and few-shottransferability to various object-level recognition tasks. 1) When directlyevaluated on COCO and LVIS (without seeing any images in COCO duringpre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing manysupervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on valand 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervisedDynamic Head. Code is released at https://github.com/microsoft/GLIP.

Code Repositories

brown-palm/ObjectPrompt
pytorch
Mentioned in GitHub
microsoft/GLIP
Official
pytorch
Mentioned in GitHub
rsCPSyEu/ovd_cod
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
2d-object-detection-on-rf100GLIP
Average mAP: 0.112
described-object-detection-on-descriptionGLIP-T
Intra-scenario ABS mAP: 21.5
Intra-scenario FULL mAP: 19.1
Intra-scenario PRES mAP: 18.3
few-shot-object-detection-on-odinw-13GLIP-T
Average Score: 50.7
few-shot-object-detection-on-odinw-35GLIP-T
Average Score: 38.9
object-detection-on-cocoGLIP (Swin-L, multi-scale)
AP50: 79.5
AP75: 67.7
APL: 75.0
APM: 64.9
APS: 45.3
box mAP: 61.5
object-detection-on-coco-minivalGLIP (Swin-L, multi-scale)
box AP: 60.8
object-detection-on-coco-oGLIP-L (Swin-L)
Average mAP: 48.0
Effective Robustness: 24.89
object-detection-on-coco-oGLIP-T (Swin-T)
Average mAP: 29.1
Effective Robustness: 8.11
object-detection-on-odinw-full-shot-13-tasksGLIP
AP: 68.9
phrase-grounding-on-flickr30k-entities-testGLIP
R@1: 87.1
R@10: 98.1
R@5: 96.9
zero-shot-object-detection-on-lvis-v1-0GLIP-L
AP: 37.3
zero-shot-object-detection-on-lvis-v1-0-valGLIP-L
AP: 26.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Grounded Language-Image Pre-training | Papers | HyperAI