Command Palette
Search for a command to run...
Zixian Ma Jerry Hong Mustafa Omer Gul Mona Gandhi Irena Gao Ranjay Krishna

Abstract
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over $370K$ image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine different complexities plus $183K$ hard negative captions with atomic, swapping and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to $12\%$. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| image-retrieval-on-crepe-vision-language | ViT-B-16 (LAION400M) | Recall@1 (HN-Atom + HN-Comp, SC): 37.01 Recall@1 (HN-Atom + HN-Comp, UC): 30.81 Recall@1 (HN-Atom, UC): 44.93 Recall@1 (HN-Comp, UC): 59.00 |
| image-retrieval-on-crepe-vision-language | RN50 (CC12M) | Recall@1 (HN-Atom + HN-Comp, SC): 23.26 Recall@1 (HN-Atom + HN-Comp, UC): 19.96 Recall@1 (HN-Atom, UC): 34.88 Recall@1 (HN-Comp, UC): 45.27 |
| image-retrieval-on-crepe-vision-language | ViT-L-14 (LAION400M) | Recall@1 (HN-Atom + HN-Comp, SC): 39.44 Recall@1 (HN-Atom + HN-Comp, UC): 33.81 Recall@1 (HN-Atom, UC): 47.86 Recall@1 (HN-Comp, UC): 60.78 |
| image-retrieval-on-crepe-vision-language | ViT-B-32 (LAION400M) | Recall@1 (HN-Atom + HN-Comp, SC): 34.28 Recall@1 (HN-Atom + HN-Comp, UC): 28.00 Recall@1 (HN-Atom, UC): 42.75 Recall@1 (HN-Comp, UC): 54.80 |
| image-retrieval-on-crepe-vision-language | RN101 (YFCC15M) | Recall@1 (HN-Atom + HN-Comp, SC): 22.74 Recall@1 (HN-Atom + HN-Comp, UC): 20.50 Recall@1 (HN-Atom, UC): 39.50 Recall@1 (HN-Comp, UC): 39.56 |
| image-retrieval-on-crepe-vision-language | ViT-B-16+240 (LAION400M) | Recall@1 (HN-Atom + HN-Comp, SC): 37.32 Recall@1 (HN-Atom + HN-Comp, UC): 32.26 Recall@1 (HN-Atom, UC): 46.53 Recall@1 (HN-Comp, UC): 60.19 |
| image-retrieval-on-crepe-vision-language | Random | Recall@1 (HN-Atom + HN-Comp, SC): 9.09 Recall@1 (HN-Atom + HN-Comp, UC): 9.09 Recall@1 (HN-Atom, UC): 20.00 Recall@1 (HN-Comp, UC): 14.29 |
| image-retrieval-on-crepe-vision-language | RN50 (YFCC15M) | Recall@1 (HN-Atom + HN-Comp, SC): 23.38 Recall@1 (HN-Atom + HN-Comp, UC): 20.08 Recall@1 (HN-Atom, UC): 39.85 Recall@1 (HN-Comp, UC): 39.83 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.