Visual Reasoning On Winoground

评估指标

Group Score
Image Score
Text Score

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
GPT-4V (CoT, pick b/w two options)58.7568.7575.25The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task-
GPT-4V (pick b/w two options)39.2546.2569.25The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task-
MMICL + CoCoT50.7552.564.25CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
GPT-4V + CoCoT44.549.558.5CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
OpenFlamingo + CoCoT41.555.2558.25CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
GPT-4V37.7542.554.5CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
FIBER (EqSim)27.532.0051.5Equivariant Similarity for Vision-Language Foundation Models
FIBER (finetuned, Flickr30k)23.0026.5051.25Equivariant Similarity for Vision-Language Foundation Models
MMICL + CCoT47.54851CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
OpenFlamingo + DDCoT3947.2547.5CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
VQ230.542.247What You See is What You Read? Improving Text-Image Alignment Evaluation
MMICL + DDCoT36.754546.75CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
X-VLM 16M21.224.546.7Measuring Progress in Fine-grained Vision-and-Language Understanding
PaLI (ft SNLI-VE + Synthetic Data)28.753846.5What You See is What You Read? Improving Text-Image Alignment Evaluation
FIBER22.2525.7546.25Equivariant Similarity for Vision-Language Foundation Models
MMICL (FLAN-T5-XXL)43.0044.9945.50MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
METER (EqSim)18.7522.7545.0Equivariant Similarity for Vision-Language Foundation Models
PaLI (ft SNLI-VE)28.7041.5045.00What You See is What You Read? Improving Text-Image Alignment Evaluation
Gemini + DDCoT23.752545CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
X-VLM 4M21.526.744.0Measuring Progress in Fine-grained Vision-and-Language Understanding
0 of 113 row(s) selected.
Visual Reasoning On Winoground | SOTA | HyperAI超神经