Command Palette
Search for a command to run...
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Junke Wang Dongdong Chen Zuxuan Wu Chong Luo Luowei Zhou Yucheng Zhao Yujia Xie Ce Liu Yu-Gang Jiang Lu Yuan

Abstract
This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | OmniVL | Acc@1: 79.1 Acc@5: 94.5 |
| action-recognition-in-videos-on-something | OmniVL | Top-1 Accuracy: 62.5 Top-5 Accuracy: 86.2 |
| cross-modal-retrieval-on-coco-2014 | OmniVL (14M) | Image-to-text R@1: 82.1 Image-to-text R@10: 98.1 Image-to-text R@5: 95.9 Text-to-image R@1: 64.8 Text-to-image R@10: 91.6 Text-to-image R@5: 86.1 |
| cross-modal-retrieval-on-flickr30k | OmniVL (14M) | Image-to-text R@1: 97.3 Image-to-text R@10: 100 Image-to-text R@5: 99.9 Text-to-image R@1: 87.9 Text-to-image R@10: 99.1 Text-to-image R@5: 97.8 |
| image-captioning-on-nocaps-val-in-domain | OmniVL | CIDEr: 104.6 Pre-train (#images): 14M SPICE: 15 |
| image-captioning-on-nocaps-val-near-domain | OmniVL | CIDEr: 108.3 Pre-train (#images): 14M SPICE: 14.9 |
| image-captioning-on-nocaps-val-out-domain | OmniVL | CIDEr: 106.3 Pretrain (#images): 14M SPICE: 14.2 |
| image-captioning-on-nocaps-val-overall | OmniVL | CIDEr: 107.5 Pretrain (#images): 14M SPICE: 14.7 |
| video-captioning-on-youcook2 | OmniVL | BLEU-3: 12.87 BLEU-4: 8.72 CIDEr: 1.16 METEOR: 14.83 ROUGE-L: 36.09 |
| video-retrieval-on-didemo | OmniVL | text-to-video R@1: 52.4 text-to-video R@10: 85.4 text-to-video R@5: 79.5 |
| video-retrieval-on-msr-vtt | OmniVL | text-to-video R@1: 47.8 text-to-video R@10: 83.8 text-to-video R@5: 74.2 |
| visual-question-answering-on-msrvtt-qa-1 | OmniVL | Accuracy: 0.441 |
| visual-question-answering-on-msvd-qa-1 | OmniVL | Accuracy: 0.510 |
| zero-shot-video-retrieval-on-didemo | OmniVL | text-to-video R@1: 33.3 text-to-video R@10: 68.5 text-to-video R@5: 58.7 |
| zero-shot-video-retrieval-on-msr-vtt | OmniVL | text-to-video R@1: 34.6 text-to-video R@10: 66.6 text-to-video R@5: 58.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.