HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

{Anonymous}

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

Abstract

The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (e.g., CLIP) with a tremendous amount of image-text pair data, has shown its superiority on various multimodal alignment tasks. Despite its success, the resulting models are not capable of generative multimodal tasks due to the weak text encoder. To tackle this problem, we propose to augment the dual-stream VLP model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD), enabling the capability for multimodal generation. VLKD is pretty data- and computation-efficient compared to the pre-training from scratch. Experimental results show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. For example, it achieves 39.7% zero-shot accuracy on the VQA 2.0 dataset, surpassing the previous state-of-the-art zero-shot model with 14x fewer parameters. Furthermore, the original text processing ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.

Benchmarks

BenchmarkMethodologyMetrics
image-captioning-on-coco-captionsVLKD (ViT-B/16)
BLEU-4: 16.7
CIDER: 58.3
METEOR: 19.7
SPICE: 13.4
visual-question-answering-on-ok-vqaVLKD(ViT-B/16)
Accuracy: 10.5
visual-question-answering-on-vqa-v2-test-devVLKD
Accuracy: 44.5
visual-question-answering-on-vqa-v2-valVLKD(ViT-B/16)
Accuracy: 38.6

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation | Papers | HyperAI