Nitzan Bitton-GuettaYonatan BittonJack HesselLudwig SchmidtYuval EloviciGabriel StanovskyRoy Schwartz

摘要
奇怪、异常且令人感到诡异的图像之所以能激发观察者的兴趣,是因为它们挑战了人们的常识认知。例如,在2022年世界杯期间发布的一幅图像中,著名足球明星利昂内尔·梅西与克里斯蒂亚诺·罗纳尔多正在下国际象棋,这一画面巧妙地违背了人们对其竞技应发生在足球场上的预期。人类能够轻松识别并理解这类非传统图像,但人工智能模型是否也能做到呢?为此,我们推出了WHOOPS!——一个用于视觉常识推理的新数据集与基准测试。该数据集由设计师利用公开可用的图像生成工具(如Midjourney)精心创作而成,其图像均刻意违背常识,以检验模型的推理能力。我们在该数据集上设定了多项任务,包括图像描述生成、跨模态匹配以及视觉问答。此外,我们还引入了一项具有挑战性的任务:解释生成,要求模型识别并解释某张图像为何显得异常。实验结果表明,当前最先进的模型(如GPT-3和BLIP2)在WHOOPS!上的表现仍显著落后于人类水平。我们希望这一数据集能推动具备更强视觉常识推理能力的人工智能模型的发展。数据集、模型与代码已开放获取,详情请访问项目官网:whoops-benchmark.github.io。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| explanation-generation-on-whoops | Predicted Caption -> GPT3 | Human (%): 33 |
| explanation-generation-on-whoops | BLIP2 FlanT5-XL (Fine-tuned) | Human (%): 15 |
| explanation-generation-on-whoops | BLIP2 FlanT5-XXL (Fine-tuned) | Human (%): 27 |
| explanation-generation-on-whoops | Ground-truth Caption -> GPT3 (Oracle) | Human (%): 68 |
| explanation-generation-on-whoops | BLIP2 FlanT5-XXL (Zero-shot) | Human (%): 0 |
| image-captioning-on-whoops | OFA Large | BLEU-4: 0 CIDEr: 0 |
| image-captioning-on-whoops | BLIP2 FlanT5-XXL (Fine-tuned) | BLEU-4: 42 CIDEr: 177 |
| image-captioning-on-whoops | CoCa ViT-L-14 MSCOCO | BLEU-4: 25 CIDEr: 102 |
| image-captioning-on-whoops | BLIP2 FlanT5-XXL (Zero-Shot) | BLEU-4: 31 CIDEr: 120 |
| image-captioning-on-whoops | BLIP Large | BLEU-4: 13 CIDEr: 65 |
| image-captioning-on-whoops | BLIP2 FlanT5-XL (Fine-tuned) | BLEU-4: 41 CIDEr: 174 |
| image-to-text-retrieval-on-whoops | BLIP2 FlanT5-XXL (Text-only FT) | Specificity: 94 |
| image-to-text-retrieval-on-whoops | BLIP2 FlanT5-XL (Fine-tuned) | Specificity: 81 |
| image-to-text-retrieval-on-whoops | CoCa ViT-L-14 MSCOCO | Specificity: 72 |
| image-to-text-retrieval-on-whoops | BLIP2 FlanT5-XXL (Zero-shot) | Specificity: 71 |
| image-to-text-retrieval-on-whoops | BLIP2 FlanT5-XXL (Fine-tuned) | Specificity: 84 |
| image-to-text-retrieval-on-whoops | BLIP Large | Specificity: 77 |
| image-to-text-retrieval-on-whoops | CLIP ViT-L/14 | Specificity: 70 |
| visual-question-answering-vqa-on-whoops | BLIP2 FlanT5-XXL (Text-only FT) | BEM: 24 Exact Match: 4 |
| visual-question-answering-vqa-on-whoops | BLIP2 FlanT5-XL (Fine-tuned) | BEM: 55 Exact Match: 20 |
| visual-question-answering-vqa-on-whoops | OFA Large | BEM: 38 Exact Match: 8 |
| visual-question-answering-vqa-on-whoops | BLIP Large | BEM: 39 Exact Match: 6 |
| visual-question-answering-vqa-on-whoops | BLIP2 FlanT5-XXL (Zero-shot) | BEM: 55 Exact Match: 15 |
| visual-question-answering-vqa-on-whoops | BLIP2 FlanT5-XXL (Fine-tuned) | BEM: 57 Exact Match: 21 |