HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

Zixian Ma Jerry Hong Mustafa Omer Gul Mona Gandhi Irena Gao Ranjay Krishna

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

Abstract

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine different complexities plus $183K$ hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to $12\%$. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.

Code Repositories

raivnlab/crepe
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
image-retrieval-on-crepe-vision-languageViT-B-16 (LAION400M)
Recall@1 (HN-Atom + HN-Comp, SC): 37.01
Recall@1 (HN-Atom + HN-Comp, UC): 30.81
Recall@1 (HN-Atom, UC): 44.93
Recall@1 (HN-Comp, UC): 59.00
image-retrieval-on-crepe-vision-languageRN50 (CC12M)
Recall@1 (HN-Atom + HN-Comp, SC): 23.26
Recall@1 (HN-Atom + HN-Comp, UC): 19.96
Recall@1 (HN-Atom, UC): 34.88
Recall@1 (HN-Comp, UC): 45.27
image-retrieval-on-crepe-vision-languageViT-L-14 (LAION400M)
Recall@1 (HN-Atom + HN-Comp, SC): 39.44
Recall@1 (HN-Atom + HN-Comp, UC): 33.81
Recall@1 (HN-Atom, UC): 47.86
Recall@1 (HN-Comp, UC): 60.78
image-retrieval-on-crepe-vision-languageViT-B-32 (LAION400M)
Recall@1 (HN-Atom + HN-Comp, SC): 34.28
Recall@1 (HN-Atom + HN-Comp, UC): 28.00
Recall@1 (HN-Atom, UC): 42.75
Recall@1 (HN-Comp, UC): 54.80
image-retrieval-on-crepe-vision-languageRN101 (YFCC15M)
Recall@1 (HN-Atom + HN-Comp, SC): 22.74
Recall@1 (HN-Atom + HN-Comp, UC): 20.50
Recall@1 (HN-Atom, UC): 39.50
Recall@1 (HN-Comp, UC): 39.56
image-retrieval-on-crepe-vision-languageViT-B-16+240 (LAION400M)
Recall@1 (HN-Atom + HN-Comp, SC): 37.32
Recall@1 (HN-Atom + HN-Comp, UC): 32.26
Recall@1 (HN-Atom, UC): 46.53
Recall@1 (HN-Comp, UC): 60.19
image-retrieval-on-crepe-vision-languageRandom
Recall@1 (HN-Atom + HN-Comp, SC): 9.09
Recall@1 (HN-Atom + HN-Comp, UC): 9.09
Recall@1 (HN-Atom, UC): 20.00
Recall@1 (HN-Comp, UC): 14.29
image-retrieval-on-crepe-vision-languageRN50 (YFCC15M)
Recall@1 (HN-Atom + HN-Comp, SC): 23.38
Recall@1 (HN-Atom + HN-Comp, UC): 20.08
Recall@1 (HN-Atom, UC): 39.85
Recall@1 (HN-Comp, UC): 39.83

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
CREPE: Can Vision-Language Foundation Models Reason Compositionally? | Papers | HyperAI