4 个月前

VALSE:一个以语言现象为中心的视觉与语言模型任务独立基准测试

VALSE:一个以语言现象为中心的视觉与语言模型任务独立基准测试

摘要

我们提出了一种新的基准测试方法——VALSE(视觉与语言结构化评估),旨在测试通用预训练视觉与语言(V&L)模型在特定语言现象上的视觉-语言对齐能力。VALSE提供了一套六项测试,涵盖了多种语言结构。解决这些测试需要模型将语言现象与视觉模态进行对齐,从而实现比以往更细致的评估。我们使用支持构建有效干扰项的方法构建了VALSE,并报告了对五种广泛使用的V&L模型进行评估的结果。实验表明,当前的模型在处理大多数语言现象时仍存在较大困难。因此,我们期望VALSE能够作为一个重要的基准,从语言学角度衡量未来预训练V&L模型的进步,补充现有的以任务为中心的V&L评估方法。

代码仓库

heidelberg-nlp/valse
官方
pytorch
GitHub 中提及

基准测试

基准方法指标
image-sentence-alignment-on-valseViLBERT 12-in-1
Average Accuracy: 63.2
average pairwise accuracy: 75.1
image-sentence-alignment-on-valseLXMERT
Average Accuracy: 53.5
average pairwise accuracy: 59.6
image-sentence-alignment-on-valseCLIP
average pairwise accuracy: 64.0
image-sentence-alignment-on-valseViLBERT
Average Accuracy: 51.3
average pairwise accuracy: 63.7
image-sentence-alignment-on-valseVisualBERT
Average Accuracy: 48.8
average pairwise accuracy: 46.4
image-sentence-alignment-on-valseGPT1
average pairwise accuracy: 60.7
image-sentence-alignment-on-valseGPT2
average pairwise accuracy: 60.1
image-sentence-alignment-on-valse-actant-swapLXMERT
Accuracy (%): 48.5
pairwise accuracy: 45.8
image-sentence-alignment-on-valse-actant-swapCLIP
pairwise accuracy: 68.6
image-sentence-alignment-on-valse-actant-swapViLBERT 12-in-1
Accuracy (%): 52.2
pairwise accuracy: 58.9
image-sentence-alignment-on-valse-actant-swapVisualBERT
Accuracy (%): 49.7
pairwise accuracy: 44.4
image-sentence-alignment-on-valse-actant-swapGPT2
pairwise accuracy: 76.9
image-sentence-alignment-on-valse-actant-swapViLBERT
Accuracy (%): 50.4
pairwise accuracy: 68.3
image-sentence-alignment-on-valse-actant-swapGPT1
pairwise accuracy: 72.2
image-sentence-alignment-on-valse-actionGPT2
pairwise accuracy: 66.8
image-sentence-alignment-on-valse-actionVisualBERT
Accuracy (%): 48.8
pairwise accuracy: 49.2
image-sentence-alignment-on-valse-actionGPT1
pairwise accuracy: 65.4
image-sentence-alignment-on-valse-actionLXMERT
Accuracy (%): 51.1
pairwise accuracy: 54.8
image-sentence-alignment-on-valse-actionViLBERT
Accuracy (%): 52.6
pairwise accuracy: 70.7
image-sentence-alignment-on-valse-actionCLIP
pairwise accuracy: 75.6
image-sentence-alignment-on-valse-actionViLBERT 12-in-1
Accuracy (%): 57.3
pairwise accuracy: 65.9
image-sentence-alignment-on-valse-coreferenceViLBERT 12-in-1
Accuracy (%): 54.4
pairwise accuracy: 75.7
image-sentence-alignment-on-valse-coreferenceCLIP
pairwise accuracy: 52.1
image-sentence-alignment-on-valse-coreferenceLXMERT
Accuracy (%): 49.8
pairwise accuracy: 46.8
image-sentence-alignment-on-valse-coreferenceViLBERT
Accuracy (%): 50.0
pairwise accuracy: 47.2
image-sentence-alignment-on-valse-coreferenceVisualBERT
Accuracy (%): 50.0
pairwise accuracy: 49.5
image-sentence-alignment-on-valse-coreferenceGPT1
pairwise accuracy: 45.6
image-sentence-alignment-on-valse-coreferenceGPT2
pairwise accuracy: 54.5
image-sentence-alignment-on-valse-coreference-1VisualBERT
Accuracy (%): 50.0
pairwise accuracy: 47.6
image-sentence-alignment-on-valse-coreference-1ViLBERT 12-in-1
Accuracy (%): 54.3
pairwise accuracy: 69.2
image-sentence-alignment-on-valse-coreference-1GPT1
pairwise accuracy: 45.2
image-sentence-alignment-on-valse-coreference-1CLIP
pairwise accuracy: 49.7
image-sentence-alignment-on-valse-coreference-1GPT2
pairwise accuracy: 50.0
image-sentence-alignment-on-valse-coreference-1LXMERT
Accuracy (%): 49.0
pairwise accuracy: 44.2
image-sentence-alignment-on-valse-coreference-1ViLBERT
Accuracy (%): 50.0
pairwise accuracy: 48.1
image-sentence-alignment-on-valse-countingLXMERT
Accuracy (%): 52.0
pairwise accuracy: 62.2
image-sentence-alignment-on-valse-countingViLBERT 12-in-1
Accuracy (%): 64.9
pairwise accuracy: 76.7
image-sentence-alignment-on-valse-countingGPT2
pairwise accuracy: 51.6
image-sentence-alignment-on-valse-countingVisualBERT
Accuracy (%): 48.3
pairwise accuracy: 48.2
image-sentence-alignment-on-valse-countingCLIP
pairwise accuracy: 62.1
image-sentence-alignment-on-valse-countingGPT1
pairwise accuracy: 51.2
image-sentence-alignment-on-valse-countingViLBERT
Accuracy (%): 50.7
pairwise accuracy: 58.6
image-sentence-alignment-on-valse-counting-1VisualBERT
Accuracy (%): 47.8
pairwise accuracy: 48.2
image-sentence-alignment-on-valse-counting-1ViLBERT
Accuracy (%): 50.6
pairwise accuracy: 62.9
image-sentence-alignment-on-valse-counting-1CLIP
pairwise accuracy: 62.5
image-sentence-alignment-on-valse-counting-1ViLBERT 12-in-1
Accuracy (%): 69.2
pairwise accuracy: 80.2
image-sentence-alignment-on-valse-counting-1GPT1
pairwise accuracy: 48.7
image-sentence-alignment-on-valse-counting-1LXMERT
Accuracy (%): 55.4
pairwise accuracy: 69.2
image-sentence-alignment-on-valse-counting-1GPT2
pairwise accuracy: 49.8
image-sentence-alignment-on-valse-counting-2ViLBERT
Accuracy (%): 51.8
pairwise accuracy: 73.7
image-sentence-alignment-on-valse-counting-2GPT1
pairwise accuracy: 69.5
image-sentence-alignment-on-valse-counting-2CLIP
pairwise accuracy: 57.5
image-sentence-alignment-on-valse-counting-2GPT2
pairwise accuracy: 45.3
image-sentence-alignment-on-valse-counting-2LXMERT
Accuracy (%): 49.9
pairwise accuracy: 42.6
image-sentence-alignment-on-valse-counting-2VisualBERT
Accuracy (%): 50.0
pairwise accuracy: 50.0
image-sentence-alignment-on-valse-counting-2ViLBERT 12-in-1
Accuracy (%): 66.7
pairwise accuracy: 77.3
image-sentence-alignment-on-valse-existenceVisualBERT
Accuracy (%): 49.3
pairwise accuracy: 39.7
image-sentence-alignment-on-valse-existenceLXMERT
Accuracy (%): 55.8
pairwise accuracy: 78.6
image-sentence-alignment-on-valse-existenceCLIP
pairwise accuracy: 66.9
image-sentence-alignment-on-valse-existenceViLBERT 12-in-1
Accuracy (%): 89.0
pairwise accuracy: 95.6
image-sentence-alignment-on-valse-existenceViLBERT
Accuracy (%): 2.4
pairwise accuracy: 66.5
image-sentence-alignment-on-valse-existenceGPT1
pairwise accuracy: 61.8
image-sentence-alignment-on-valse-existenceGPT2
pairwise accuracy: 58.0
image-sentence-alignment-on-valse-foil-itGPT2
pairwise accuracy: 80.7
image-sentence-alignment-on-valse-foil-itViLBERT 12-in-1
Accuracy (%): 71.5
pairwise accuracy: 86.9
image-sentence-alignment-on-valse-foil-itGPT1
pairwise accuracy: 77.5
image-sentence-alignment-on-valse-foil-itLXMERT
Accuracy (%): 70.8
pairwise accuracy: 87.1
image-sentence-alignment-on-valse-foil-itVisualBERT
Accuracy (%): 46.6
pairwise accuracy: 48.5
image-sentence-alignment-on-valse-foil-itViLBERT
Accuracy (%): 55.9
pairwise accuracy: 86.9
image-sentence-alignment-on-valse-foil-itCLIP
pairwise accuracy: 88.8
image-sentence-alignment-on-valse-pluralityViLBERT 12-in-1
Accuracy (%): 62.0
pairwise accuracy: 72.4
image-sentence-alignment-on-valse-pluralityLXMERT
Accuracy (%): 55.1
pairwise accuracy: 64.4
image-sentence-alignment-on-valse-pluralityCLIP
pairwise accuracy: 56.2
image-sentence-alignment-on-valse-pluralityGPT1
pairwise accuracy: 53.1
image-sentence-alignment-on-valse-pluralityViLBERT
Accuracy (%): 50.3
pairwise accuracy: 61.2
image-sentence-alignment-on-valse-pluralityVisualBERT
Accuracy (%): 46.5
pairwise accuracy: 45.7
image-sentence-alignment-on-valse-pluralityGPT2
pairwise accuracy: 51.9
image-sentence-alignment-on-valse-spatialVisualBERT
Accuracy (%): 49.3
pairwise accuracy: 39.7
image-sentence-alignment-on-valse-spatialGPT2
pairwise accuracy: 75.0
image-sentence-alignment-on-valse-spatialCLIP
pairwise accuracy: 64.3
image-sentence-alignment-on-valse-spatialLXMERT
Accuracy (%): 50.8
pairwise accuracy: 60.2
image-sentence-alignment-on-valse-spatialViLBERT
Accuracy (%): 49.9
pairwise accuracy: 57.2
image-sentence-alignment-on-valse-spatialViLBERT 12-in-1
Accuracy (%): 53.4
pairwise accuracy: 67.7
image-sentence-alignment-on-valse-spatialGPT1
pairwise accuracy: 77.2

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
VALSE:一个以语言现象为中心的视觉与语言模型任务独立基准测试 | 论文 | HyperAI超神经