3 个月前

GIT:一种用于视觉与语言的生成式图像到文本Transformer

GIT:一种用于视觉与语言的生成式图像到文本Transformer

摘要

本文提出并训练了一种生成式图像到文本的Transformer模型——GIT,旨在统一视觉-语言任务,如图像/视频描述生成与视觉问答。尽管生成式模型在预训练与微调阶段能够保持一致的网络架构,但现有方法通常结构复杂(如单模态或多模态编码器-解码器),且依赖外部模块(如目标检测器、标签识别器和光学字符识别,OCR)。在GIT中,我们大幅简化了架构:仅采用一个图像编码器与一个文本解码器,并在单一语言建模任务下进行训练。同时,我们通过扩大预训练数据规模与模型参数量,显著提升了模型性能。在不依赖额外复杂组件的前提下,GIT在12个具有挑战性的基准测试中均取得了新的SOTA(state-of-the-art)成绩,且性能优势显著。例如,我们的模型首次在TextCaps数据集上超越人类表现(CIDEr得分达到138.2,超过人类水平的125.5)。此外,我们提出了一种基于生成式的图像分类与场景文本识别新方法,在标准基准上取得了良好的效果。相关代码已开源,地址为:\url{https://github.com/microsoft/GenerativeImage2Text}。

代码仓库

microsoft/GenerativeImage2Text
官方
pytorch
GitHub 中提及

基准测试

基准方法指标
image-captioning-on-coco-captionsGIT
BLEU-4: 44.1
CIDER: 151.1
METEOR: 32.2
SPICE: 26.3
image-captioning-on-nocaps-entireGIT, Single Model
B1: 88.1
B2: 74.81
B3: 57.68
B4: 37.35
CIDEr: 123.39
METEOR: 32.5
ROUGE-L: 63.12
SPICE: 15.94
image-captioning-on-nocaps-in-domainGIT2, Single Model
B1: 88.86
B2: 75.86
B3: 59.94
B4: 41.1
CIDEr: 124.18
METEOR: 33.83
ROUGE-L: 63.82
SPICE: 16.36
image-captioning-on-nocaps-in-domainGIT, Single Model
B1: 88.55
B2: 76.1
B3: 60.53
B4: 41.65
CIDEr: 122.4
METEOR: 33.41
ROUGE-L: 64.02
SPICE: 16.18
image-captioning-on-nocaps-near-domainGIT2, Single Model
B1: 88.9
B2: 75.86
B3: 58.9
B4: 38.95
CIDEr: 125.51
METEOR: 32.95
ROUGE-L: 63.66
SPICE: 16.11
image-captioning-on-nocaps-near-domainGIT, Single Model
B1: 88.56
B2: 75.48
B3: 58.46
B4: 38.44
CIDEr: 123.92
METEOR: 32.86
ROUGE-L: 63.5
SPICE: 15.96
image-captioning-on-nocaps-out-of-domainGIT2, Single Model
B1: 86.28
B2: 71.15
B3: 52.36
B4: 30.15
CIDEr: 122.27
METEOR: 30.15
ROUGE-L: 60.91
SPICE: 15.62
image-captioning-on-nocaps-out-of-domainGIT, Single Model
B1: 85.99
B2: 71.28
B3: 52.66
B4: 30.04
CIDEr: 122.04
METEOR: 30.45
ROUGE-L: 60.96
SPICE: 15.7
image-captioning-on-nocaps-xd-entireGIT
B1: 88.1
B2: 74.81
B3: 57.68
B4: 37.35
CIDEr: 123.39
METEOR: 32.5
ROUGE-L: 63.12
SPICE: 15.94
image-captioning-on-nocaps-xd-entireGIT2
B1: 88.43
B2: 75.02
B3: 57.87
B4: 37.65
CIDEr: 124.77
METEOR: 32.56
ROUGE-L: 63.19
SPICE: 16.06
image-captioning-on-nocaps-xd-in-domainGIT2
B1: 88.86
B2: 75.86
B3: 59.94
B4: 41.1
CIDEr: 124.18
METEOR: 33.83
ROUGE-L: 63.82
SPICE: 16.36
image-captioning-on-nocaps-xd-in-domainGIT
B1: 88.55
B2: 76.1
B3: 60.53
B4: 41.65
CIDEr: 122.4
METEOR: 33.41
ROUGE-L: 64.02
SPICE: 16.18
image-captioning-on-nocaps-xd-near-domainGIT2
B1: 88.9
B2: 75.86
B3: 58.9
B4: 38.95
CIDEr: 125.51
METEOR: 32.95
ROUGE-L: 63.66
SPICE: 16.11
image-captioning-on-nocaps-xd-near-domainGIT
B1: 88.56
B2: 75.48
B3: 58.46
B4: 38.44
CIDEr: 123.92
METEOR: 32.86
ROUGE-L: 63.5
SPICE: 15.96
image-captioning-on-nocaps-xd-out-of-domainGIT2
B1: 86.28
B2: 71.15
B3: 52.36
B4: 30.15
CIDEr: 122.27
METEOR: 30.15
ROUGE-L: 60.91
SPICE: 15.62
image-captioning-on-nocaps-xd-out-of-domainGIT
B1: 85.99
B2: 71.28
B3: 52.66
B4: 30.04
CIDEr: 122.04
METEOR: 30.45
ROUGE-L: 60.96
SPICE: 15.7
video-captioning-on-msr-vtt-1GIT2
BLEU-4: 54.8
CIDEr: 75.9
GS: 201.6
METEOR: 33.1
ROUGE-L: 68.2
visual-question-answering-on-msvd-qa-1GIT
Accuracy: 0.568

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
GIT:一种用于视觉与语言的生成式图像到文本Transformer | 论文 | HyperAI超神经