3 months ago

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Jing Liu Sihan Chen Xingjian He Longteng Guo Xinxin Zhu Weining Wang Jinhui Tang

Abstract

In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It contains three separate encoders for single modality representations, and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio to the same common space, building vision-language, audio-language and audiovisual-language alignment simultaneously. MGC learns how to generate text tokens in conditions of vision, audio or their both. To promote vision-audio-language pretraining research, we construct a large-scale high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable videos with human annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks (e.g., retrieval, captioning and question answering), with different input modalities (e.g., vision-language, audio-language and audiovisual-language). VALOR achieves new state-of-the-art performances on series of public cross-modality benchmarks. Code and data are available at project page https://casia-iva-group.github.io/projects/VALOR.

Code Repositories

TXH-mercury/VALOR

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
audio-captioning-on-audiocaps	VALOR	BLEU-4: 0.270 CIDEr: 0.741 METEOR: 0.231 ROUGE-L: 0.494
audio-captioning-on-clotho	VALOR	BLEU-4: 16.2 CIDEr: 0.423 METEOR: 17.4 ROUGE-L: 38.2
cross-modal-retrieval-on-coco-2014	VALOR	Text-to-image R@1: 61.4 Text-to-image R@10: 90.9 Text-to-image R@5: 84.4
image-captioning-on-coco-captions	VALOR	CIDER: 152.5 SPICE: 25.7
video-captioning-on-msr-vtt-1	VALOR	BLEU-4: 54.4 CIDEr: 74.0 METEOR: 32.9 ROUGE-L: 68.0
video-captioning-on-msvd-1	VALOR	BLEU-4: 80.7 CIDEr: 178.5 METEOR: 51.0 ROUGE-L: 87.9
video-captioning-on-vatex-1	VALOR	BLEU-4: 45.6 CIDEr: 95.8 METEOR: 29.4 ROUGE-L: 57.4
video-question-answering-on-activitynet-qa	VALOR	Accuracy: 48.6
video-question-answering-on-msrvtt-qa	VALOR	Accuracy: 49.2
video-retrieval-on-activitynet	VALOR	text-to-video R@1: 70.1 text-to-video R@10: 95.3 text-to-video R@5: 90.8
video-retrieval-on-didemo	VALOR	text-to-video R@1: 61.5 text-to-video R@10: 90.4 text-to-video R@5: 85.3
video-retrieval-on-lsmdc	VALOR	text-to-video R@1: 34.2 text-to-video R@10: 64.1 text-to-video R@5: 56.0
video-retrieval-on-msr-vtt	VALOR	text-to-video R@1: 59.9 text-to-video R@10: 89.6 text-to-video R@5: 83.5
video-retrieval-on-vatex	VALOR	text-to-video R@1: 78.5 text-to-video R@10: 98.7 text-to-video R@5: 97.1
visual-question-answering-on-msvd-qa-1	VALOR	Accuracy: 0.60
visual-question-answering-on-vqa-v2-test-dev	VALOR	Accuracy: 78.46
visual-question-answering-on-vqa-v2-test-std	VALOR	overall: 78.62

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette