
摘要
视觉和文本在当代视频-文本基础模型中已得到充分探索,而视频中的其他模态,如音频和字幕,尚未获得足够的关注。本文旨在通过构建一个大规模自动生成的全模态视频字幕数据集(VAST-27M)来建立视觉、音频和字幕与文本之间的联系。具体而言,我们首先收集了2700万个开放领域的视频片段,并分别训练了一个视觉字幕生成器和一个音频字幕生成器以生成视觉和音频字幕。然后,我们利用现成的大规模语言模型(LLM)将生成的字幕与字幕文本及指令提示整合为全模态字幕。基于所提出的VAST-27M数据集,我们训练了一个名为VAST的全模态视频-文本基础模型,该模型能够感知和处理来自视频的视觉、音频和字幕模态,并更好地支持包括视觉-文本、音频-文本以及多模态视频-文本任务(检索、字幕生成和问答)在内的多种任务。我们进行了大量实验以证明所提出的VAST-27M语料库和VAST基础模型的有效性。VAST在各种跨模态基准测试中取得了22项新的最佳结果。代码、模型和数据集将在https://github.com/TXH-mercury/VAST发布。
代码仓库
TXH-mercury/VALOR
pytorch
GitHub 中提及
txh-mercury/vast
官方
pytorch
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| audio-captioning-on-audiocaps | VAST | BLEU-4: 0.295 CIDEr: 0.781 METEOR: 0.247 ROUGE-L: 0.509 |
| audio-captioning-on-clotho | VAST | BLEU-4: 19 CIDEr: 0.519 METEOR: 19.3 ROUGE-L: 40.8 |
| cross-modal-retrieval-on-coco-2014 | VAST | Text-to-image R@1: 68.0 Text-to-image R@10: 92.8 Text-to-image R@5: 87.7 |
| cross-modal-retrieval-on-flickr30k | VAST | Text-to-image R@1: 91.0 Text-to-image R@10: 99.5 Text-to-image R@5: 98.5 |
| image-captioning-on-coco-captions | VAST | CIDER: 149.0 SPICE: 27.0 |
| video-captioning-on-msr-vtt-1 | VAST | BLEU-4: 56.7 CIDEr: 78.0 |
| video-captioning-on-tvc | VAST | BLEU-4: 19.9 CIDEr: 74.1 |
| video-captioning-on-vatex-1 | VAST | BLEU-4: 45.0 CIDEr: 99.5 |
| video-captioning-on-youcook2 | VAST | BLEU-4: 18.2 CIDEr: 1.99 |
| video-question-answering-on-activitynet-qa | VAST | Accuracy: 50.4 |
| video-question-answering-on-msrvtt-qa | VAST | Accuracy: 50.1 |
| video-retrieval-on-activitynet | VAST | text-to-video R@1: 70.5 text-to-video R@10: 95.5 text-to-video R@5: 90.9 |
| video-retrieval-on-didemo | VAST | text-to-video R@1: 72.0 text-to-video R@10: 91.4 text-to-video R@5: 89.0 |
| video-retrieval-on-msr-vtt | VAST | text-to-video R@1: 63.9 text-to-video R@10: 89.6 text-to-video R@5: 84.3 |
| video-retrieval-on-vatex | VAST | text-to-video R@1: 83.0 text-to-video R@10: 99.2 text-to-video R@5: 98.2 |
| video-retrieval-on-youcook2 | VAST | text-to-video R@1: 50.4 text-to-video R@10: 80.8 text-to-video R@5: 74.3 |
| visual-question-answering-on-msvd-qa-1 | VAST | Accuracy: 0.60 |
| zero-shot-cross-modal-retrieval-on-flickr30k | VAST | Text-to-image R@1: 90.4 |
| zero-shot-video-retrieval-on-didemo | VAST | text-to-video R@1: 55.5 text-to-video R@10: 79.6 text-to-video R@5: 74.3 |
| zero-shot-video-retrieval-on-msr-vtt | VAST | text-to-video R@1: 49.3 text-to-video R@10: 73.9 text-to-video R@5: 68.3 |