AlayracJean-Baptiste ; DonahueJeff ; LucPauline ; MiechAntoine ; BarrIain ; HassonYana ; LencKarel ; MenschArthur ; MillicanKatie ; ReynoldsMalcolm ; RingRoman ; RutherfordEliza ; CabiSerkan ; HanTengda ; GongZhitao ; SamangooeiSina ; MonteiroMarianne ; MenickJacob ; BorgeaudSebastian ; BrockAndrew ; NematzadehAida ; SharifzadehSahand ; BinkowskiMikolaj ; BarreiraRicardo ; VinyalsOriol ; ZissermanAndrew ; SimonyanKaren

摘要
构建能够使用少量标注示例快速适应新任务的模型,是多模态机器学习研究面临的一个开放性挑战。我们介绍了Flamingo,这是一系列具备这种能力的视觉语言模型(VLM)。我们提出了关键的架构创新,旨在:(i) 桥接强大的预训练视觉模型和语言模型,(ii) 处理任意交错的视觉和文本数据序列,以及 (iii) 无缝接收图像或视频作为输入。由于其灵活性,Flamingo模型可以在包含任意交错文本和图像的大规模多模态网络语料库上进行训练,这是赋予它们上下文内少样本学习能力的关键。我们对这些模型进行了全面评估,探索并测量了它们快速适应多种图像和视频任务的能力。这些任务包括开放式任务,如视觉问答,其中模型需要根据给定的问题进行回答;描述任务,用于评估模型描述场景或事件的能力;以及封闭式任务,如多项选择题形式的视觉问答。对于谱系中的任何任务,单个Flamingo模型通过少样本学习即可实现新的最先进水平,只需用特定任务的示例提示模型即可。在众多基准测试中,Flamingo的表现超过了那些在数倍于其的任务特定数据上微调的模型。
代码仓库
doc-doc/NExT-OE
pytorch
GitHub 中提及
lucidrains/flamingo-pytorch
pytorch
happen2me/cross-gnn
pytorch
GitHub 中提及
unispac/visual-adversarial-examples-jailbreak-large-language-models
pytorch
GitHub 中提及
mlfoundations/open_flamingo
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-recognition-on-rareact | - | mWAP: 60.8 |
| generative-visual-question-answering-on-pmc | Open-Flamingo | BLEU-1: 4.1 |
| meme-classification-on-hateful-memes | Flamingo (few-shot:32) | ROC-AUC: 0.700 |
| meme-classification-on-hateful-memes | Flamingo (fine-tuned) | ROC-AUC: 0.866 |
| temporal-casual-qa-on-next-qa | Flamingo(0-shot) | WUPS: 26.7 |
| temporal-casual-qa-on-next-qa | Flamingo(32-shot) | WUPS: 33.5 |
| video-question-answering-on-situated | Flamingo-9B (4-shot) | Average Accuracy: 42.8 |
| video-question-answering-on-situated | Flamingo-80B (0-shot) | Average Accuracy: 39.7 |
| video-question-answering-on-situated | Flamingo-9B (0-shot) | Average Accuracy: 41.8 |
| video-question-answering-on-situated | Flamingo-80B (4-shot) | Average Accuracy: 42.4 |
| visual-question-answering-on-msrvtt-qa-1 | Flamingo (32-shot) | Accuracy: 0.310 |
| visual-question-answering-on-msrvtt-qa-1 | Flamingo (0-shot) | Accuracy: 0.174 |
| visual-question-answering-on-msrvtt-qa-1 | Flamingo | Accuracy: 0.474 |
| visual-question-answering-on-ok-vqa | Flamingo3B | Accuracy: 41.2 |
| visual-question-answering-on-ok-vqa | Flamingo9B | Accuracy: 44.7 |
| visual-question-answering-on-ok-vqa | Flamingo80B | Accuracy: 50.6 |
| visual-question-answering-on-vqa-v2-test-dev | Flamingo 80B | Accuracy: 56.3 |
| visual-question-answering-on-vqa-v2-test-dev | Flamingo 3B | Accuracy: 49.2 |
| visual-question-answering-on-vqa-v2-test-dev | Flamingo 9B | Accuracy: 51.8 |
| visual-question-answering-vqa-on-pmc-vqa | Open-Flamingo | Accuracy: 26.4 |
| zero-shot-cross-modal-retrieval-on-coco-2014 | Flamingo | Image-to-text R@1: 65.9 Image-to-text R@10: 92.9 Image-to-text R@5: 87.3 Text-to-image R@1: 48.0 Text-to-image R@10: 82.1 Text-to-image R@5: 73.3 |
| zero-shot-cross-modal-retrieval-on-flickr30k | Flamingo | Image-to-text R@1: 89.3 Image-to-text R@10: 99.7 Image-to-text R@5: 98.8 Text-to-image R@1: 79.5 Text-to-image R@10: 97.9 Text-to-image R@5: 95.3 |
| zero-shot-video-question-answer-on-star | Flamingo-9B | Accuracy: 41.8 |