HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Zirui Wang Jiahui Yu Adams Wei Yu Zihang Dai Yulia Tsvetkov Yuan Cao

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Abstract

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

Code Repositories

yulong-XJTU/SimVLM
pytorch
Mentioned in GitHub
FerryHuang/SimVLM
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
image-captioning-on-coco-captionsSimVLM
BLEU-4: 40.6
CIDER: 143.3
METEOR: 33.4
SPICE: 25.4
image-captioning-on-nocaps-entireSingle Model
B1: 83.78
B2: 68.86
B3: 51.06
B4: 32.2
CIDEr: 110.31
METEOR: 30.55
ROUGE-L: 59.86
SPICE: 14.49
image-captioning-on-nocaps-in-domainSingle Model
B1: 84.64
B2: 70.0
B3: 52.96
B4: 34.66
CIDEr: 108.98
METEOR: 31.97
ROUGE-L: 61.01
SPICE: 14.6
image-captioning-on-nocaps-near-domainSingle Model
B1: 84.36
B2: 69.83
B3: 52.42
B4: 33.74
CIDEr: 110.76
METEOR: 30.97
ROUGE-L: 60.46
SPICE: 14.61
image-captioning-on-nocaps-out-of-domainSingle Model
B1: 80.89
B2: 64.21
B3: 44.38
B4: 24.47
CIDEr: 109.49
METEOR: 27.91
ROUGE-L: 56.69
SPICE: 13.89
image-captioning-on-nocaps-val-in-domainSimVLM
CIDEr: 113.7
Pre-train (#images): 1.8B
SPICE: -
image-captioning-on-nocaps-val-near-domainSimVLM
CIDEr: 110.9
Pre-train (#images): 1.8B
SPICE: -
image-captioning-on-nocaps-val-out-domainSimVLM
CIDEr: 115.2
Pretrain (#images): 1.8B
SPICE: -
image-captioning-on-nocaps-val-overallSimVLM
CIDEr: 112.2
Pretrain (#images): 1.8B
SPICE: -
visual-entailment-on-snli-ve-testSimVLM
Accuracy: 86.32
visual-entailment-on-snli-ve-valSimVLM
Accuracy: 86.21
visual-question-answering-on-vqa-v2-test-devSimVLM
Accuracy: 80.03
visual-question-answering-on-vqa-v2-test-stdSimVLM
overall: 80.34
visual-reasoning-on-nlvr2-devSimVLM
Accuracy: 84.53
visual-reasoning-on-nlvr2-testSimVLM
Accuracy: 85.15

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | Papers | HyperAI