4 个月前

PaLI:一种联合扩展的多语言文本-图像模型

PaLI:一种联合扩展的多语言文本-图像模型

摘要

有效的扩展和灵活的任务接口使得大型语言模型在许多任务中表现出色。我们介绍了PaLI(Pathways Language and Image模型),这是一种将语言和视觉联合建模的方法。PaLI基于视觉和文本输入生成文本,并通过这一接口执行多种视觉、语言和多模态任务,支持多种语言。为了训练PaLI,我们利用了大规模预训练的编码器-解码器语言模型和视觉变换器(Vision Transformers,简称ViTs)。这使我们能够充分利用它们现有的能力,并分摊其高昂的训练成本。我们发现,视觉和语言组件的联合扩展非常重要。由于现有的语言Transformer比其视觉对应部分要大得多,我们训练了一个具有40亿参数的大规模ViT(ViT-e),以量化更大容量视觉模型带来的好处。为了训练PaLI,我们创建了一个包含100多种语言的100亿张图像和文本的新图像-文本训练集,并基于此构建了一个大规模的多语言预训练任务混合体。PaLI在多个视觉和语言任务(如图像描述、视觉问答、场景文本理解)中达到了最先进的水平,同时保持了简单、模块化和可扩展的设计。

代码仓库

google-research/big_vision
官方
jax
GitHub 中提及

基准测试

基准方法指标
image-captioning-on-nocaps-in-domainPaLI
CIDEr: 149.1
image-captioning-on-nocaps-in-domainPaLI
B1: 88.02
B2: 75.21
B3: 59.38
B4: 41.16
CIDEr: 121.09
METEOR: 34.22
ROUGE-L: 64.39
SPICE: 15.69
image-captioning-on-nocaps-near-domainPaLI
SPICE: 15.75
image-captioning-on-nocaps-near-domainPaLI
B1: 88.57
B2: 75.56
B3: 58.99
B4: 39.98
CIDEr: 124.35
METEOR: 33.47
ROUGE-L: 63.99
SPICE: 15.75
image-captioning-on-nocaps-out-of-domainPaLI
B1: 86.28
B2: 71.19
B3: 52.63
B4: 32.0
CIDEr: 126.67
METEOR: 30.99
ROUGE-L: 61.35
SPICE: 15.49
image-classification-on-imagenet-v2ViT-e
Top 1 Accuracy: 84.3
image-classification-on-objectnetViT-e
Top-1 Accuracy: 72.0
visual-question-answering-on-ok-vqaPaLI 17B
Accuracy: 64.5
visual-question-answering-on-textvqa-test-1PaLI
overall: 73.1
visual-question-answering-on-vizwiz-2020-vqaPaLI
overall: 73.3
visual-question-answering-on-vqa-v2-test-devPaLI
Accuracy: 84.3
zero-shot-transfer-image-classification-on-1LiT ViT-e
Accuracy (Private): 85.4
zero-shot-transfer-image-classification-on-1PaLI
Accuracy (Private): 72.11
zero-shot-transfer-image-classification-on-3PaLI
Accuracy (Private): 64.46
zero-shot-transfer-image-classification-on-3LiT ViT-e
Accuracy (Private): 80.6
zero-shot-transfer-image-classification-on-4PaLI
Accuracy: 81.97
zero-shot-transfer-image-classification-on-4LiT ViT-e
Accuracy: 96.1
zero-shot-transfer-image-classification-on-5LiT ViT-e
Accuracy (Private): 88.0
zero-shot-transfer-image-classification-on-5PaLI
Accuracy (Private): 44.7
zero-shot-transfer-image-classification-on-6PaLI
Accuracy (Private): 42.62
Top 5 Accuracy: 58.35
zero-shot-transfer-image-classification-on-6LiT ViT-e
Accuracy (Private): 84.9
zero-shot-transfer-image-classification-on-9PaLI
Accuracy (Private): 63.83
Top 5 Accuracy: 79.3

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
PaLI:一种联合扩展的多语言文本-图像模型 | 论文 | HyperAI超神经