HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Sihan Chen; Handong Li; Qunbo Wang; Zijia Zhao; Mingzhen Sun; Xinxin Zhu; Jing Liu

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Abstract

Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.

Code Repositories

TXH-mercury/VALOR
pytorch
Mentioned in GitHub
txh-mercury/vast
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
audio-captioning-on-audiocapsVAST
BLEU-4: 0.295
CIDEr: 0.781
METEOR: 0.247
ROUGE-L: 0.509
audio-captioning-on-clothoVAST
BLEU-4: 19
CIDEr: 0.519
METEOR: 19.3
ROUGE-L: 40.8
cross-modal-retrieval-on-coco-2014VAST
Text-to-image R@1: 68.0
Text-to-image R@10: 92.8
Text-to-image R@5: 87.7
cross-modal-retrieval-on-flickr30kVAST
Text-to-image R@1: 91.0
Text-to-image R@10: 99.5
Text-to-image R@5: 98.5
image-captioning-on-coco-captionsVAST
CIDER: 149.0
SPICE: 27.0
video-captioning-on-msr-vtt-1VAST
BLEU-4: 56.7
CIDEr: 78.0
video-captioning-on-tvcVAST
BLEU-4: 19.9
CIDEr: 74.1
video-captioning-on-vatex-1VAST
BLEU-4: 45.0
CIDEr: 99.5
video-captioning-on-youcook2VAST
BLEU-4: 18.2
CIDEr: 1.99
video-question-answering-on-activitynet-qaVAST
Accuracy: 50.4
video-question-answering-on-msrvtt-qaVAST
Accuracy: 50.1
video-retrieval-on-activitynetVAST
text-to-video R@1: 70.5
text-to-video R@10: 95.5
text-to-video R@5: 90.9
video-retrieval-on-didemoVAST
text-to-video R@1: 72.0
text-to-video R@10: 91.4
text-to-video R@5: 89.0
video-retrieval-on-msr-vttVAST
text-to-video R@1: 63.9
text-to-video R@10: 89.6
text-to-video R@5: 84.3
video-retrieval-on-vatexVAST
text-to-video R@1: 83.0
text-to-video R@10: 99.2
text-to-video R@5: 98.2
video-retrieval-on-youcook2VAST
text-to-video R@1: 50.4
text-to-video R@10: 80.8
text-to-video R@5: 74.3
visual-question-answering-on-msvd-qa-1VAST
Accuracy: 0.60
zero-shot-cross-modal-retrieval-on-flickr30kVAST
Text-to-image R@1: 90.4
zero-shot-video-retrieval-on-didemoVAST
text-to-video R@1: 55.5
text-to-video R@10: 79.6
text-to-video R@5: 74.3
zero-shot-video-retrieval-on-msr-vttVAST
text-to-video R@1: 49.3
text-to-video R@10: 73.9
text-to-video R@5: 68.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset | Papers | HyperAI