| GPT-4V (CoT, pick b/w two options) | 58.75 | 68.75 | 75.25 | The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task | - |
| GPT-4V (pick b/w two options) | 39.25 | 46.25 | 69.25 | The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task | - |
| FIBER (finetuned, Flickr30k) | 23.00 | 26.50 | 51.25 | Equivariant Similarity for Vision-Language Foundation Models | |
| PaLI (ft SNLI-VE + Synthetic Data) | 28.75 | 38 | 46.5 | What You See is What You Read? Improving Text-Image Alignment Evaluation | |