HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Junke Wang Dongdong Chen Zuxuan Wu Chong Luo Luowei Zhou Yucheng Zhao Yujia Xie Ce Liu Yu-Gang Jiang Lu Yuan

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Abstract

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

Benchmarks

BenchmarkMethodologyMetrics
action-classification-on-kinetics-400OmniVL
Acc@1: 79.1
Acc@5: 94.5
action-recognition-in-videos-on-somethingOmniVL
Top-1 Accuracy: 62.5
Top-5 Accuracy: 86.2
cross-modal-retrieval-on-coco-2014OmniVL (14M)
Image-to-text R@1: 82.1
Image-to-text R@10: 98.1
Image-to-text R@5: 95.9
Text-to-image R@1: 64.8
Text-to-image R@10: 91.6
Text-to-image R@5: 86.1
cross-modal-retrieval-on-flickr30kOmniVL (14M)
Image-to-text R@1: 97.3
Image-to-text R@10: 100
Image-to-text R@5: 99.9
Text-to-image R@1: 87.9
Text-to-image R@10: 99.1
Text-to-image R@5: 97.8
image-captioning-on-nocaps-val-in-domainOmniVL
CIDEr: 104.6
Pre-train (#images): 14M
SPICE: 15
image-captioning-on-nocaps-val-near-domainOmniVL
CIDEr: 108.3
Pre-train (#images): 14M
SPICE: 14.9
image-captioning-on-nocaps-val-out-domainOmniVL
CIDEr: 106.3
Pretrain (#images): 14M
SPICE: 14.2
image-captioning-on-nocaps-val-overallOmniVL
CIDEr: 107.5
Pretrain (#images): 14M
SPICE: 14.7
video-captioning-on-youcook2OmniVL
BLEU-3: 12.87
BLEU-4: 8.72
CIDEr: 1.16
METEOR: 14.83
ROUGE-L: 36.09
video-retrieval-on-didemoOmniVL
text-to-video R@1: 52.4
text-to-video R@10: 85.4
text-to-video R@5: 79.5
video-retrieval-on-msr-vttOmniVL
text-to-video R@1: 47.8
text-to-video R@10: 83.8
text-to-video R@5: 74.2
visual-question-answering-on-msrvtt-qa-1OmniVL
Accuracy: 0.441
visual-question-answering-on-msvd-qa-1OmniVL
Accuracy: 0.510
zero-shot-video-retrieval-on-didemoOmniVL
text-to-video R@1: 33.3
text-to-video R@10: 68.5
text-to-video R@5: 58.7
zero-shot-video-retrieval-on-msr-vttOmniVL
text-to-video R@1: 34.6
text-to-video R@10: 66.6
text-to-video R@5: 58.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks | Papers | HyperAI