
摘要
视频-语言(VL)预训练在多个下游任务中取得了显著的改进。然而,当前的VL预训练框架难以扩展到视觉和语言之外的多种模态(N种模态,N≥3)。因此,我们提出了LanguageBind,利用语言作为不同模态之间的桥梁,因为语言模态已经被广泛研究并且包含丰富的语义。具体而言,我们冻结了通过VL预训练获得的语言编码器,然后使用对比学习训练其他模态的编码器。结果,所有模态都被映射到一个共享的特征空间,实现了多模态语义对齐。虽然LanguageBind确保可以将VL模态扩展到N种模态,但我们还需要一个高质量的数据集,其中包含以语言为中心的对齐数据对。为此,我们提出了VIDAL-10M数据集,该数据集包括视频、红外、深度和音频及其相应的语言描述,命名为VIDAL-10M。在我们的VIDAL-10M中,所有视频均来自短视频平台,具有完整的语义而非从长视频中截取的片段,并且所有的视频、深度、红外和音频模态都与其文本描述进行了对齐。LanguageBind在涵盖视频、音频、深度和红外的15个基准测试中表现出色。此外,多项实验提供了证据,证明LanguageBind在实现间接对齐和不同模态之间的互补性方面具有有效性。代码地址:https://github.com/PKU-YuanGroup/LanguageBind
代码仓库
pku-yuangroup/video-bench
GitHub 中提及
PKU-YuanGroup/MoE-LLaVA
pytorch
GitHub 中提及
zhihaozhang97/ru-ai
pytorch
GitHub 中提及
pku-yuangroup/languagebind
官方
pytorch
GitHub 中提及
PKU-YuanGroup/Video-LLaVA
pytorch
GitHub 中提及
PKU-YuanGroup/LLMBind
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| temporal-relation-extraction-on-vinoground | LanguageBind | Group Score: 1.2 Text Score: 10.6 Video Score: 5 |
| zero-shot-action-recognition-on-kinetics | LanguageBind | Top-1 Accuracy: 64.1 Top-5 Accuracy: 85.7 |
| zero-shot-video-retrieval-on-activitynet | LanguageBind(ViT-L/14) | text-to-video R@1: 38.4 text-to-video R@10: 77.9 text-to-video R@5: 66.6 video-to-text R@1: 35.7 video-to-text R@10: 77.8 video-to-text R@5: 65.8 |
| zero-shot-video-retrieval-on-activitynet | LanguageBind(ViT-H/14) | text-to-video R@1: 41.0 text-to-video R@10: 80.0 text-to-video R@5: 68.4 video-to-text R@1: 39.1 video-to-text R@10: 81.1 video-to-text R@5: 69.8 |
| zero-shot-video-retrieval-on-didemo | LanguageBind(ViT-H/14) | text-to-video Median Rank: 2 text-to-video R@1: 39.9 text-to-video R@10: 74.6 text-to-video R@5: 66.1 video-to-text R@1: 39.8 video-to-text R@10: 76.2 video-to-text R@5: 67.8 |
| zero-shot-video-retrieval-on-didemo | LanguageBind(ViT-L/14) | text-to-video Median Rank: 2.0 text-to-video R@1: 39.7 text-to-video R@10: 73.8 text-to-video R@5: 65.5 video-to-text R@1: 38.4 video-to-text R@10: 77.9 video-to-text R@5: 66.6 |
| zero-shot-video-retrieval-on-msr-vtt | LanguageBind(ViT-L/14) | text-to-video Median Rank: 2.0 text-to-video R@1: 42.8 text-to-video R@10: 76.0 text-to-video R@5: 67.5 video-to-text Median Rank: 3.0 video-to-text R@1: 38.3 video-to-text R@10: 77.8 video-to-text R@5: 65.8 |
| zero-shot-video-retrieval-on-msr-vtt | LanguageBind(ViT-H/14) | text-to-video Median Rank: 2 text-to-video R@1: 44.8 text-to-video R@10: 78.7 text-to-video R@5: 70.0 video-to-text Median Rank: 2. video-to-text R@1: 40.9 video-to-text R@10: 75.7 video-to-text R@5: 66.4 |
| zero-shot-video-retrieval-on-msvd | LanguageBind(ViT-H/14) | text-to-video Median Rank: 1 text-to-video R@1: 53.9 text-to-video R@10: 87.8 text-to-video R@5: 80.4 video-to-text Median Rank: 1 video-to-text R@1: 72.0 video-to-text R@10: 96.3 video-to-text R@5: 91.4 |
| zero-shot-video-retrieval-on-msvd | LanguageBind(ViT-L/14) | text-to-video Median Rank: 1.0 text-to-video R@1: 54.1 text-to-video R@10: 88.1 text-to-video R@5: 81.1 video-to-text Median Rank: 1.0 video-to-text R@1: 69.7 video-to-text R@10: 97.9 video-to-text R@5: 91.8 |