5 months ago

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Sihan Chen; Handong Li; Qunbo Wang; Zijia Zhao; Mingzhen Sun; Xinxin Zhu; Jing Liu

Abstract

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.

Code Repositories

TXH-mercury/VALOR

pytorch

Mentioned in GitHub

txh-mercury/vast

Official

pytorch

Benchmarks

Benchmark	Methodology	Metrics
audio-captioning-on-audiocaps	VAST	BLEU-4: 0.295 CIDEr: 0.781 METEOR: 0.247 ROUGE-L: 0.509
audio-captioning-on-clotho	VAST	BLEU-4: 19 CIDEr: 0.519 METEOR: 19.3 ROUGE-L: 40.8
cross-modal-retrieval-on-coco-2014	VAST	Text-to-image R@1: 68.0 Text-to-image R@10: 92.8 Text-to-image R@5: 87.7
cross-modal-retrieval-on-flickr30k	VAST	Text-to-image R@1: 91.0 Text-to-image R@10: 99.5 Text-to-image R@5: 98.5
image-captioning-on-coco-captions	VAST	CIDER: 149.0 SPICE: 27.0
video-captioning-on-msr-vtt-1	VAST	BLEU-4: 56.7 CIDEr: 78.0
video-captioning-on-tvc	VAST	BLEU-4: 19.9 CIDEr: 74.1
video-captioning-on-vatex-1	VAST	BLEU-4: 45.0 CIDEr: 99.5
video-captioning-on-youcook2	VAST	BLEU-4: 18.2 CIDEr: 1.99
video-question-answering-on-activitynet-qa	VAST	Accuracy: 50.4
video-question-answering-on-msrvtt-qa	VAST	Accuracy: 50.1
video-retrieval-on-activitynet	VAST	text-to-video R@1: 70.5 text-to-video R@10: 95.5 text-to-video R@5: 90.9
video-retrieval-on-didemo	VAST	text-to-video R@1: 72.0 text-to-video R@10: 91.4 text-to-video R@5: 89.0
video-retrieval-on-msr-vtt	VAST	text-to-video R@1: 63.9 text-to-video R@10: 89.6 text-to-video R@5: 84.3
video-retrieval-on-vatex	VAST	text-to-video R@1: 83.0 text-to-video R@10: 99.2 text-to-video R@5: 98.2
video-retrieval-on-youcook2	VAST	text-to-video R@1: 50.4 text-to-video R@10: 80.8 text-to-video R@5: 74.3
visual-question-answering-on-msvd-qa-1	VAST	Accuracy: 0.60
zero-shot-cross-modal-retrieval-on-flickr30k	VAST	Text-to-image R@1: 90.4
zero-shot-video-retrieval-on-didemo	VAST	text-to-video R@1: 55.5 text-to-video R@10: 79.6 text-to-video R@5: 74.3
zero-shot-video-retrieval-on-msr-vtt	VAST	text-to-video R@1: 49.3 text-to-video R@10: 73.9 text-to-video R@5: 68.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette