HyperAIHyperAI

Command Palette

Search for a command to run...

Grounded Language-Image Pre-training

Abstract

This paper presents a grounded language-image pre-training (GLIP) model forlearning object-level, language-aware, and semantic-rich visualrepresentations. GLIP unifies object detection and phrase grounding forpre-training. The unification brings two benefits: 1) it allows GLIP to learnfrom both detection and grounding data to improve both tasks and bootstrap agood grounding model; 2) GLIP can leverage massive image-text pairs bygenerating grounding boxes in a self-training fashion, making the learnedrepresentation semantic-rich. In our experiments, we pre-train GLIP on 27Mgrounding data, including 3M human-annotated and 24M web-crawled image-textpairs. The learned representations demonstrate strong zero-shot and few-shottransferability to various object-level recognition tasks. 1) When directlyevaluated on COCO and LVIS (without seeing any images in COCO duringpre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing manysupervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on valand 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervisedDynamic Head. Code is released at https://github.com/microsoft/GLIP.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp