
摘要
本文对提升视觉-语言(Vision-Language, VL)任务中的视觉表征进行了深入研究,并提出了一种改进的物体检测模型,以生成以物体为中心的图像表征。与目前应用最广泛的自底向上与自顶向下(bottom-up and top-down)模型 \cite{anderson2018bottom} 相比,新模型具有更大的规模,其架构设计更契合VL任务需求,并在更大规模的训练语料上进行了预训练,该语料融合了多个公开的标注物体检测数据集。因此,该模型能够生成涵盖更丰富视觉对象与概念的表征。以往的VL研究主要聚焦于优化视觉-语言融合模型,而对物体检测模型本身的改进则较少关注。本文证明,视觉特征的质量在VL模型中具有显著影响。在实验中,我们将新物体检测模型生成的视觉特征输入基于Transformer的VL融合模型OSCAR \cite{li2020oscar},并采用一种改进的预训练方法 \short,对VL模型进行预训练,并在多种下游VL任务上进行微调。实验结果表明,新生成的视觉特征显著提升了所有VL任务的性能,在七个公开基准测试上均取得了新的最先进(SOTA)结果。相关的新物体检测模型将向公众开源发布。
代码仓库
microsoft/Oscar
pytorch
GitHub 中提及
cattidea/VinVL-Paddle
paddle
GitHub 中提及
mkhalil1998/EC601_Group_Project
pytorch
GitHub 中提及
pzzhang/VinVL
官方
GitHub 中提及
JoshuaPlacidi/MS-COCO-Object-Tags
GitHub 中提及
yaolinli/capenrich
pytorch
GitHub 中提及
JoshuaPlacidi/MS-COCO-Tags
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-captioning-on-coco-captions | VinVL | BLEU-4: 41.0 CIDER: 140.9 METEOR: 31.1 SPICE: 25.2 |
| image-captioning-on-nocaps-entire | VinVL (Microsoft Cognitive Services + MSR) | B1: 81.59 B2: 65.15 B3: 45.04 B4: 26.15 CIDEr: 92.46 METEOR: 27.57 ROUGE-L: 56.96 SPICE: 13.07 |
| image-captioning-on-nocaps-in-domain | VinVL (Microsoft Cognitive Services + MSR) | B1: 83.24 B2: 68.04 B3: 49.68 B4: 30.62 CIDEr: 97.99 METEOR: 29.51 ROUGE-L: 58.54 SPICE: 13.63 |
| image-captioning-on-nocaps-near-domain | VinVL (Microsoft Cognitive Services + MSR) | B1: 82.77 B2: 66.94 B3: 47.02 B4: 27.97 CIDEr: 95.16 METEOR: 28.24 ROUGE-L: 57.95 SPICE: 13.36 |
| image-captioning-on-nocaps-out-of-domain | VinVL (Microsoft Cognitive Services + MSR) | B1: 75.78 B2: 56.1 B3: 34.02 B4: 15.86 CIDEr: 78.01 METEOR: 23.55 ROUGE-L: 51.99 SPICE: 11.48 |
| image-captioning-on-nocaps-val-in-domain | VinVL | CIDEr: 103.1 Pre-train (#images): 5.7M SPICE: 14.2 |
| image-captioning-on-nocaps-val-near-domain | VinVL | CIDEr: 96.1 Pre-train (#images): 5.7M SPICE: 13.8 |
| image-captioning-on-nocaps-val-out-domain | VinVL | CIDEr: 88.3 Pretrain (#images): 5.7M SPICE: 12.1 |
| image-captioning-on-nocaps-val-overall | VinVL | CIDEr: 95.5 Pretrain (#images): 5.7M SPICE: 13.5 |
| image-text-matching-on-commercialadsdataset | VinVL | ADD(S) AUC: 88.56 |
| visual-question-answering-on-gqa-test2019 | Single Model | Accuracy: 64.65 Binary: 82.63 Consistency: 94.35 Distribution: 4.72 Open: 48.77 Plausibility: 84.98 Validity: 96.62 |
| visual-question-answering-on-vqa-v2-test-std | MSR + MS Cog. Svcs. | number: 61.5 other: 66.68 overall: 76.63 yes/no: 92.04 |
| visual-question-answering-on-vqa-v2-test-std | MSR + MS Cog. Svcs., X10 models | number: 62.55 other: 67.87 overall: 77.45 yes/no: 92.38 |