Command Palette
Search for a command to run...
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen; Handong Li; Qunbo Wang; Zijia Zhao; Mingzhen Sun; Xinxin Zhu; Jing Liu

Abstract
Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| audio-captioning-on-audiocaps | VAST | BLEU-4: 0.295 CIDEr: 0.781 METEOR: 0.247 ROUGE-L: 0.509 |
| audio-captioning-on-clotho | VAST | BLEU-4: 19 CIDEr: 0.519 METEOR: 19.3 ROUGE-L: 40.8 |
| cross-modal-retrieval-on-coco-2014 | VAST | Text-to-image R@1: 68.0 Text-to-image R@10: 92.8 Text-to-image R@5: 87.7 |
| cross-modal-retrieval-on-flickr30k | VAST | Text-to-image R@1: 91.0 Text-to-image R@10: 99.5 Text-to-image R@5: 98.5 |
| image-captioning-on-coco-captions | VAST | CIDER: 149.0 SPICE: 27.0 |
| video-captioning-on-msr-vtt-1 | VAST | BLEU-4: 56.7 CIDEr: 78.0 |
| video-captioning-on-tvc | VAST | BLEU-4: 19.9 CIDEr: 74.1 |
| video-captioning-on-vatex-1 | VAST | BLEU-4: 45.0 CIDEr: 99.5 |
| video-captioning-on-youcook2 | VAST | BLEU-4: 18.2 CIDEr: 1.99 |
| video-question-answering-on-activitynet-qa | VAST | Accuracy: 50.4 |
| video-question-answering-on-msrvtt-qa | VAST | Accuracy: 50.1 |
| video-retrieval-on-activitynet | VAST | text-to-video R@1: 70.5 text-to-video R@10: 95.5 text-to-video R@5: 90.9 |
| video-retrieval-on-didemo | VAST | text-to-video R@1: 72.0 text-to-video R@10: 91.4 text-to-video R@5: 89.0 |
| video-retrieval-on-msr-vtt | VAST | text-to-video R@1: 63.9 text-to-video R@10: 89.6 text-to-video R@5: 84.3 |
| video-retrieval-on-vatex | VAST | text-to-video R@1: 83.0 text-to-video R@10: 99.2 text-to-video R@5: 98.2 |
| video-retrieval-on-youcook2 | VAST | text-to-video R@1: 50.4 text-to-video R@10: 80.8 text-to-video R@5: 74.3 |
| visual-question-answering-on-msvd-qa-1 | VAST | Accuracy: 0.60 |
| zero-shot-cross-modal-retrieval-on-flickr30k | VAST | Text-to-image R@1: 90.4 |
| zero-shot-video-retrieval-on-didemo | VAST | text-to-video R@1: 55.5 text-to-video R@10: 79.6 text-to-video R@5: 74.3 |
| zero-shot-video-retrieval-on-msr-vtt | VAST | text-to-video R@1: 49.3 text-to-video R@10: 73.9 text-to-video R@5: 68.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.