ChenXi ; DjolongaJosip ; PadlewskiPiotr ; MustafaBasil ; ChangpinyoSoravit ; WuJialin ; RuizCarlos Riquelme ; GoodmanSebastian ; WangXiao ; TayYi ; ShakeriSiamak ; DehghaniMostafa ; SalzDaniel ; LucicMario ; TschannenMichael ; NagraniArsha ; HuHexiang ; JoshiMandar ; PangBo ; MontgomeryCeslee ; PietrzykPaulina ; RitterMarvin ; PiergiovanniAJ ; MindererMatthias ; PaveticFilip ; WatersAustin ; LiGang ; AlabdulmohsinIbrahim ; BeyerLucas ; AmelotJulien ; LeeKenton ; SteinerAndreas Peter ; LiYang ; KeysersDaniel ; ArnabAnurag ; XuYuanzhong ; RongKeran ; KolesnikovAlexander ; SeyedhosseiniMojtaba ; AngelovaAnelia ; ZhaiXiaohua ; HoulsbyNeil ; SoricutRadu

摘要
我们介绍了多语言视觉与语言模型PaLI-X的训练方法及其在组件规模和训练任务多样性方面的扩展结果。该模型在多种复杂任务上实现了新的性能水平,包括基于图像的标题生成和问答任务、基于图像的文档理解、少量样本(上下文)学习,以及目标检测、视频问答和视频标题生成。PaLI-X在大多数考虑的视觉与语言基准测试中(超过25个)取得了最先进的成果。最后,我们观察到一些新兴能力的出现,例如复杂的计数和多语言目标检测,这些任务并未明确包含在训练任务组合中。
代码仓库
doc-doc/NExT-OE
pytorch
GitHub 中提及
kyegomez/PALI
pytorch
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| chart-question-answering-on-chartqa | PaLI-X (Single-task FT) | 1:1 Accuracy: 70.9 |
| chart-question-answering-on-chartqa | PaLI-X (Multi-task FT) | 1:1 Accuracy: 70.6 |
| chart-question-answering-on-chartqa | PaLI-X (Single-task FT w/ OCR) | 1:1 Accuracy: 72.3 |
| fine-grained-image-recognition-on-oven | PaLI-X | Accuracy: 23.1 |
| temporal-casual-qa-on-next-qa | PaLI-X | WUPS: 38.3 |
| visual-question-answering-on-docvqa-test | PaLI-X (Single-task FT w/ OCR) | ANLS: 0.868 |
| visual-question-answering-on-docvqa-test | PaLI-X (Single-task FT) | ANLS: 0.80 |
| visual-question-answering-on-docvqa-test | PaLI-X (Multi-task FT) | ANLS: 0.809 |
| visual-question-answering-on-ok-vqa | PaLI-X (Single-task FT) | Accuracy: 66.1 |
| visual-question-answering-vqa-on | PaLI-X (Single-task FT) | ANLS: 49.2 |
| visual-question-answering-vqa-on | PaLI-X (Multi-task FT) | ANLS: 50.7 |
| visual-question-answering-vqa-on | PaLI-X (Single-task FT w/ OCR) | ANLS: 54.8 |
| visual-question-answering-vqa-on-infoseek | PaLI-X | Accuracy: 24 |