
摘要
对比训练的视觉-语言模型在视觉与语言表征学习方面取得了显著进展,推动了多种下游多模态任务的最先进模型发展。然而,近期研究揭示了这类模型在对象、属性及关系的组合推理能力方面存在严重局限。场景图(scene graphs)作为一种有效手段,被广泛用于实现图像的组合性理解。场景图是图像的图结构语义表示,包含场景中的对象、其属性以及对象之间的相互关系。在本研究中,我们以文本解析出的场景图为图像场景图的代理,提出了一种图分解与增强框架,并设计了一种从粗到细的对比学习目标,实现不同复杂度句子与同一图像之间的对齐。此外,我们还提出了新颖的场景图空间负样本挖掘方法,以提升属性绑定与关系理解能力。通过大量实验验证,所提方法在多个近期提出的基准测试上显著提升了属性绑定、关系理解、系统性泛化能力以及生成能力(例如,系统性泛化性能相比强基线提升高达18%,关系理解能力提升16.5%),同时在各类通用多模态任务中实现了与CLIP相当或更优的性能。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-retrieval-on-crepe-vision-language | Swin-T (CLIP, CC-12M) | Recall@1 (HN-Atom, UC): 37.3 Recall@1 (HN-Comp, UC): 44.1 |
| image-retrieval-on-crepe-vision-language | RN-50 (CLIP, CC-12M) | Recall@1 (HN-Atom, UC): 36.7 Recall@1 (HN-Comp, UC): 42.9 |
| image-retrieval-on-crepe-vision-language | MosaiCLIP (CC-FT) | Recall@1 (HN-Atom, UC): 40.9 Recall@1 (HN-Comp, UC): 72.4 |
| image-retrieval-on-crepe-vision-language | NegCLIP (YFCC-FT) | Recall@1 (HN-Atom, UC): 39.0 Recall@1 (HN-Comp, UC): 38.8 |
| image-retrieval-on-crepe-vision-language | CLIP-FT (YFCC-FT) | Recall@1 (HN-Atom, UC): 38.3 Recall@1 (HN-Comp, UC): 36.4 |
| image-retrieval-on-crepe-vision-language | CLIP-FT (CC-FT) | Recall@1 (HN-Atom, UC): 35.6 Recall@1 (HN-Comp, UC): 45.8 |
| image-retrieval-on-crepe-vision-language | CLIP (YFCC-FT) | Recall@1 (HN-Atom, UC): 39.5 Recall@1 (HN-Comp, UC): 39.8 |
| image-retrieval-on-crepe-vision-language | CLIP (CC-FT) | Recall@1 (HN-Atom, UC): 35.0 Recall@1 (HN-Comp, UC): 45.1 |
| image-retrieval-on-crepe-vision-language | RN-50 (NegCLIP, CC-12M) | Recall@1 (HN-Atom, UC): 41.4 Recall@1 (HN-Comp, UC): 82.0 |
| image-retrieval-on-crepe-vision-language | RN-50 (MosaiCLIP, CC-12M) | Recall@1 (HN-Atom, UC): 44.4 Recall@1 (HN-Comp, UC): 92.6 |
| image-retrieval-on-crepe-vision-language | NegCLIP (CC-FT) | Recall@1 (HN-Atom, UC): 37.5 Recall@1 (HN-Comp, UC): 53.1 |
| image-retrieval-on-crepe-vision-language | Swin-T (MosaiCLIP, CC-12M) | Recall@1 (HN-Atom, UC): 44.5 Recall@1 (HN-Comp, UC): 92.1 |
| image-retrieval-on-crepe-vision-language | Swin-T (NegCLIP, CC-12M) | Recall@1 (HN-Atom, UC): 39.6 Recall@1 (HN-Comp, UC): 80.3 |
| image-retrieval-on-crepe-vision-language | MosaiCLIP (YFCC-FT) | Recall@1 (HN-Atom, UC): 41.5 Recall@1 (HN-Comp, UC): 48.8 |