5 months ago

Grounded Language-Image Pre-training

Li Liunian Harold ; Zhang Pengchuan ; Zhang Haotian ; Yang Jianwei ; Li Chunyuan ; Zhong Yiwu ; Wang Lijuan ; Yuan Lu ; Zhang Lei ; Hwang

Abstract

This paper presents a grounded language-image pre-training (GLIP) model forlearning object-level, language-aware, and semantic-rich visualrepresentations. GLIP unifies object detection and phrase grounding forpre-training. The unification brings two benefits: 1) it allows GLIP to learnfrom both detection and grounding data to improve both tasks and bootstrap agood grounding model; 2) GLIP can leverage massive image-text pairs bygenerating grounding boxes in a self-training fashion, making the learnedrepresentation semantic-rich. In our experiments, we pre-train GLIP on 27Mgrounding data, including 3M human-annotated and 24M web-crawled image-textpairs. The learned representations demonstrate strong zero-shot and few-shottransferability to various object-level recognition tasks. 1) When directlyevaluated on COCO and LVIS (without seeing any images in COCO duringpre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing manysupervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on valand 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervisedDynamic Head. Code is released at https://github.com/microsoft/GLIP.

Code Repositories

brown-palm/ObjectPrompt

pytorch

Mentioned in GitHub

microsoft/GLIP

Official

pytorch

Mentioned in GitHub

rsCPSyEu/ovd_cod

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
2d-object-detection-on-rf100	GLIP	Average mAP: 0.112
described-object-detection-on-description	GLIP-T	Intra-scenario ABS mAP: 21.5 Intra-scenario FULL mAP: 19.1 Intra-scenario PRES mAP: 18.3
few-shot-object-detection-on-odinw-13	GLIP-T	Average Score: 50.7
few-shot-object-detection-on-odinw-35	GLIP-T	Average Score: 38.9
object-detection-on-coco	GLIP (Swin-L, multi-scale)	AP50: 79.5 AP75: 67.7 APL: 75.0 APM: 64.9 APS: 45.3 box mAP: 61.5
object-detection-on-coco-minival	GLIP (Swin-L, multi-scale)	box AP: 60.8
object-detection-on-coco-o	GLIP-L (Swin-L)	Average mAP: 48.0 Effective Robustness: 24.89
object-detection-on-coco-o	GLIP-T (Swin-T)	Average mAP: 29.1 Effective Robustness: 8.11
object-detection-on-odinw-full-shot-13-tasks	GLIP	AP: 68.9
phrase-grounding-on-flickr30k-entities-test	GLIP	R@1: 87.1 R@10: 98.1 R@5: 96.9
zero-shot-object-detection-on-lvis-v1-0	GLIP-L	AP: 37.3
zero-shot-object-detection-on-lvis-v1-0-val	GLIP-L	AP: 26.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Grounded Language-Image Pre-training

Li Liunian Harold ; Zhang Pengchuan ; Zhang Haotian ; Yang Jianwei ; Li Chunyuan ; Zhong Yiwu ; Wang Lijuan ; Yuan Lu ; Zhang Lei ; Hwang3 more

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Li Liunian Harold ; Zhang Pengchuan ; Zhang Haotian ; Yang Jianwei ; Li Chunyuan ; Zhong Yiwu ; Wang Lijuan ; Yuan Lu ; Zhang Lei ; Hwang