HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Jing Liu Sihan Chen Xingjian He Longteng Guo Xinxin Zhu Weining Wang Jinhui Tang

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Abstract

In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It contains three separate encoders for single modality representations, and a decoder for multimodal conditional text generation. We design two pretext tasks to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio to the same common space, building vision-language, audio-language and audiovisual-language alignment simultaneously. MGC learns how to generate text tokens in conditions of vision, audio or their both. To promote vision-audio-language pretraining research, we construct a large-scale high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable videos with human annotated audiovisual captions. Extensive experiments show that VALOR can learn strong multimodal correlations and be generalized to various downstream tasks (e.g., retrieval, captioning and question answering), with different input modalities (e.g., vision-language, audio-language and audiovisual-language). VALOR achieves new state-of-the-art performances on series of public cross-modality benchmarks. Code and data are available at project page https://casia-iva-group.github.io/projects/VALOR.

Code Repositories

TXH-mercury/VALOR
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
audio-captioning-on-audiocapsVALOR
BLEU-4: 0.270
CIDEr: 0.741
METEOR: 0.231
ROUGE-L: 0.494
audio-captioning-on-clothoVALOR
BLEU-4: 16.2
CIDEr: 0.423
METEOR: 17.4
ROUGE-L: 38.2
cross-modal-retrieval-on-coco-2014VALOR
Text-to-image R@1: 61.4
Text-to-image R@10: 90.9
Text-to-image R@5: 84.4
image-captioning-on-coco-captionsVALOR
CIDER: 152.5
SPICE: 25.7
video-captioning-on-msr-vtt-1VALOR
BLEU-4: 54.4
CIDEr: 74.0
METEOR: 32.9
ROUGE-L: 68.0
video-captioning-on-msvd-1VALOR
BLEU-4: 80.7
CIDEr: 178.5
METEOR: 51.0
ROUGE-L: 87.9
video-captioning-on-vatex-1VALOR
BLEU-4: 45.6
CIDEr: 95.8
METEOR: 29.4
ROUGE-L: 57.4
video-question-answering-on-activitynet-qaVALOR
Accuracy: 48.6
video-question-answering-on-msrvtt-qaVALOR
Accuracy: 49.2
video-retrieval-on-activitynetVALOR
text-to-video R@1: 70.1
text-to-video R@10: 95.3
text-to-video R@5: 90.8
video-retrieval-on-didemoVALOR
text-to-video R@1: 61.5
text-to-video R@10: 90.4
text-to-video R@5: 85.3
video-retrieval-on-lsmdcVALOR
text-to-video R@1: 34.2
text-to-video R@10: 64.1
text-to-video R@5: 56.0
video-retrieval-on-msr-vttVALOR
text-to-video R@1: 59.9
text-to-video R@10: 89.6
text-to-video R@5: 83.5
video-retrieval-on-vatexVALOR
text-to-video R@1: 78.5
text-to-video R@10: 98.7
text-to-video R@5: 97.1
visual-question-answering-on-msvd-qa-1VALOR
Accuracy: 0.60
visual-question-answering-on-vqa-v2-test-devVALOR
Accuracy: 78.46
visual-question-answering-on-vqa-v2-test-stdVALOR
overall: 78.62

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset | Papers | HyperAI