
摘要
本文提出并训练了一种生成式图像到文本的Transformer模型——GIT,旨在统一视觉-语言任务,如图像/视频描述生成与视觉问答。尽管生成式模型在预训练与微调阶段能够保持一致的网络架构,但现有方法通常结构复杂(如单模态或多模态编码器-解码器),且依赖外部模块(如目标检测器、标签识别器和光学字符识别,OCR)。在GIT中,我们大幅简化了架构:仅采用一个图像编码器与一个文本解码器,并在单一语言建模任务下进行训练。同时,我们通过扩大预训练数据规模与模型参数量,显著提升了模型性能。在不依赖额外复杂组件的前提下,GIT在12个具有挑战性的基准测试中均取得了新的SOTA(state-of-the-art)成绩,且性能优势显著。例如,我们的模型首次在TextCaps数据集上超越人类表现(CIDEr得分达到138.2,超过人类水平的125.5)。此外,我们提出了一种基于生成式的图像分类与场景文本识别新方法,在标准基准上取得了良好的效果。相关代码已开源,地址为:\url{https://github.com/microsoft/GenerativeImage2Text}。
代码仓库
microsoft/GenerativeImage2Text
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-captioning-on-coco-captions | GIT | BLEU-4: 44.1 CIDER: 151.1 METEOR: 32.2 SPICE: 26.3 |
| image-captioning-on-nocaps-entire | GIT, Single Model | B1: 88.1 B2: 74.81 B3: 57.68 B4: 37.35 CIDEr: 123.39 METEOR: 32.5 ROUGE-L: 63.12 SPICE: 15.94 |
| image-captioning-on-nocaps-in-domain | GIT2, Single Model | B1: 88.86 B2: 75.86 B3: 59.94 B4: 41.1 CIDEr: 124.18 METEOR: 33.83 ROUGE-L: 63.82 SPICE: 16.36 |
| image-captioning-on-nocaps-in-domain | GIT, Single Model | B1: 88.55 B2: 76.1 B3: 60.53 B4: 41.65 CIDEr: 122.4 METEOR: 33.41 ROUGE-L: 64.02 SPICE: 16.18 |
| image-captioning-on-nocaps-near-domain | GIT2, Single Model | B1: 88.9 B2: 75.86 B3: 58.9 B4: 38.95 CIDEr: 125.51 METEOR: 32.95 ROUGE-L: 63.66 SPICE: 16.11 |
| image-captioning-on-nocaps-near-domain | GIT, Single Model | B1: 88.56 B2: 75.48 B3: 58.46 B4: 38.44 CIDEr: 123.92 METEOR: 32.86 ROUGE-L: 63.5 SPICE: 15.96 |
| image-captioning-on-nocaps-out-of-domain | GIT2, Single Model | B1: 86.28 B2: 71.15 B3: 52.36 B4: 30.15 CIDEr: 122.27 METEOR: 30.15 ROUGE-L: 60.91 SPICE: 15.62 |
| image-captioning-on-nocaps-out-of-domain | GIT, Single Model | B1: 85.99 B2: 71.28 B3: 52.66 B4: 30.04 CIDEr: 122.04 METEOR: 30.45 ROUGE-L: 60.96 SPICE: 15.7 |
| image-captioning-on-nocaps-xd-entire | GIT | B1: 88.1 B2: 74.81 B3: 57.68 B4: 37.35 CIDEr: 123.39 METEOR: 32.5 ROUGE-L: 63.12 SPICE: 15.94 |
| image-captioning-on-nocaps-xd-entire | GIT2 | B1: 88.43 B2: 75.02 B3: 57.87 B4: 37.65 CIDEr: 124.77 METEOR: 32.56 ROUGE-L: 63.19 SPICE: 16.06 |
| image-captioning-on-nocaps-xd-in-domain | GIT2 | B1: 88.86 B2: 75.86 B3: 59.94 B4: 41.1 CIDEr: 124.18 METEOR: 33.83 ROUGE-L: 63.82 SPICE: 16.36 |
| image-captioning-on-nocaps-xd-in-domain | GIT | B1: 88.55 B2: 76.1 B3: 60.53 B4: 41.65 CIDEr: 122.4 METEOR: 33.41 ROUGE-L: 64.02 SPICE: 16.18 |
| image-captioning-on-nocaps-xd-near-domain | GIT2 | B1: 88.9 B2: 75.86 B3: 58.9 B4: 38.95 CIDEr: 125.51 METEOR: 32.95 ROUGE-L: 63.66 SPICE: 16.11 |
| image-captioning-on-nocaps-xd-near-domain | GIT | B1: 88.56 B2: 75.48 B3: 58.46 B4: 38.44 CIDEr: 123.92 METEOR: 32.86 ROUGE-L: 63.5 SPICE: 15.96 |
| image-captioning-on-nocaps-xd-out-of-domain | GIT2 | B1: 86.28 B2: 71.15 B3: 52.36 B4: 30.15 CIDEr: 122.27 METEOR: 30.15 ROUGE-L: 60.91 SPICE: 15.62 |
| image-captioning-on-nocaps-xd-out-of-domain | GIT | B1: 85.99 B2: 71.28 B3: 52.66 B4: 30.04 CIDEr: 122.04 METEOR: 30.45 ROUGE-L: 60.96 SPICE: 15.7 |
| video-captioning-on-msr-vtt-1 | GIT2 | BLEU-4: 54.8 CIDEr: 75.9 GS: 201.6 METEOR: 33.1 ROUGE-L: 68.2 |
| visual-question-answering-on-msvd-qa-1 | GIT | Accuracy: 0.568 |