3 个月前

X$^2$-VLM:面向视觉-语言任务的全功能预训练模型

X$^2$-VLM:面向视觉-语言任务的全功能预训练模型

摘要

视觉-语言预训练旨在从海量数据中学习视觉与语言之间的对齐关系。现有大多数方法仅关注图像与文本之间的对齐,而部分方法则借助预训练的目标检测器,在物体层级上建模视觉与语言的对齐关系。本文提出一种统一的预训练框架,能够同时学习多粒度的视觉-语言对齐与多粒度的定位能力,从而实现多粒度视觉-语言对齐的联合建模。基于该框架,我们提出了X²-VLM,一个具有灵活模块化架构的统一模型,进一步将图像-文本预训练与视频-文本预训练统一于同一模型之中。X²-VLM能够学习与多样化文本描述相关联的无限视觉概念。实验结果表明,无论在图像-文本任务还是视频-文本任务上,X²-VLM在基础模型和大规模模型设置下均表现最优,实现了性能与模型规模之间的良好权衡。此外,我们验证了X²-VLM模块化设计带来的高可迁移性,使其可灵活应用于任意语言或领域。例如,仅通过将文本编码器替换为XLM-R,X²-VLM便在无需任何多语言预训练的情况下,超越了当前最先进的多语言多模态预训练模型。代码与预训练模型已开源,地址为:https://github.com/zengyan-97/X2-VLM。

代码仓库

zengyan-97/x2-vlm
官方
pytorch
zengyan-97/x-vlm
pytorch
GitHub 中提及

基准测试

基准方法指标
cross-modal-retrieval-on-coco-2014X2-VLM (base)
Image-to-text R@1: 83.5
Image-to-text R@10: 98.5
Image-to-text R@5: 96.3
Text-to-image R@1: 66.2
Text-to-image R@10: 92.2
Text-to-image R@5: 87.1
cross-modal-retrieval-on-coco-2014X2-VLM (large)
Image-to-text R@1: 84.4
Image-to-text R@10: 98.5
Image-to-text R@5: 96.5
Text-to-image R@1: 67.7
Text-to-image R@10: 92.5
Text-to-image R@5: 87.5
cross-modal-retrieval-on-flickr30kX2-VLM (base)
Image-to-text R@1: 98.5
Image-to-text R@10: 100
Image-to-text R@5: 100
Text-to-image R@1: 90.4
Text-to-image R@10: 99.3
Text-to-image R@5: 98.2
cross-modal-retrieval-on-flickr30kX2-VLM (large)
Image-to-text R@1: 98.8
Image-to-text R@10: 100
Image-to-text R@5: 100
Text-to-image R@1: 91.8
Text-to-image R@10: 99.5
Text-to-image R@5: 98.6
video-retrieval-on-msr-vtt-1kaX2-VLM (large)
text-to-video R@1: 49.6
text-to-video R@10: 84.2
text-to-video R@5: 76.7
video-retrieval-on-msr-vtt-1kaX2-VLM (base)
text-to-video R@1: 47.6
text-to-video R@10: 84.2
text-to-video R@5: 74.1
visual-grounding-on-refcoco-test-bX2-VLM (base)
Accuracy (%): 78.4
visual-grounding-on-refcoco-test-bX2-VLM (large)
Accuracy (%): 81.8
visual-grounding-on-refcoco-testaX2-VLM (large)
Accuracy (%): 92.1
visual-grounding-on-refcoco-testaX2-VLM (base)
Accuracy (%): 90.3
visual-grounding-on-refcoco-valX2-VLM (base)
Accuracy (%): 85.2
visual-grounding-on-refcoco-valX2-VLM (large)
Accuracy (%): 87.6
visual-question-answering-on-msrvtt-qa-1X2-VLM (base)
Accuracy: 0.45
visual-question-answering-on-msrvtt-qa-1X2-VLM (large)
Accuracy: 0.455
visual-question-answering-on-msvd-qa-1X2-VLM (base)
Accuracy: 0.528
visual-question-answering-on-msvd-qa-1X2-VLM (large)
Accuracy: 0.546
visual-question-answering-on-vqa-v2-test-devX2-VLM (base)
Accuracy: 80.4
visual-question-answering-on-vqa-v2-test-devX2-VLM (large)
Accuracy: 81.9
visual-question-answering-on-vqa-v2-test-stdX2-VLM (large)
overall: 81.8
visual-question-answering-on-vqa-v2-test-stdX2-VLM (base)
overall: 80.2
visual-reasoning-on-nlvr2-devX2-VLM (large)
Accuracy: 88.7
visual-reasoning-on-nlvr2-devX2-VLM (base)
Accuracy: 86.2
visual-reasoning-on-nlvr2-testX2-VLM (large)
Accuracy: 89.4
visual-reasoning-on-nlvr2-testX2-VLM (base)
Accuracy: 87.0

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
X$^2$-VLM:面向视觉-语言任务的全功能预训练模型 | 论文 | HyperAI超神经