Command Palette
Search for a command to run...
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs
Zhang Daoan ; Yang Junming ; Lyu Hanjia ; Jin Zijian ; Yao Yuan ; Chen Mingkai ; Luo Jiebo

Abstract
When exploring the development of Artificial General Intelligence (AGI), acritical task for these models involves interpreting and processing informationfrom multiple image inputs. However, Large Multimodal Models (LMMs) encountertwo issues in such scenarios: (1) a lack of fine-grained perception, and (2) atendency to blend information across multiple images. We first extensivelyinvestigate the capability of LMMs to perceive fine-grained visual details whendealing with multiple input images. The research focuses on two aspects: first,image-to-image matching (to evaluate whether LMMs can effectively reason andpair relevant images), and second, multi-image-to-text matching (to assesswhether LMMs can accurately capture and summarize detailed image information).We conduct evaluations on a range of both open-source and closed-source largemodels, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance modelperformance, we further develop a Contrastive Chain-of-Thought (CoCoT)prompting approach based on multi-input multimodal models. This method requiresLMMs to compare the similarities and differences among multiple image inputs,and then guide the models to answer detailed questions about multi-image inputsbased on the identified similarities and differences. Our experimental resultsshowcase CoCoT's proficiency in enhancing the multi-image comprehensioncapabilities of large multimodal models.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-reasoning-on-winoground | Gemini + CoCoT | Group Score: 27.75 Image Score: 32.5 Text Score: 40 |
| visual-reasoning-on-winoground | OpenFlamingo + CoCoT | Group Score: 41.5 Image Score: 55.25 Text Score: 58.25 |
| visual-reasoning-on-winoground | OpenFlamingo | Group Score: 33.25 Image Score: 41.25 Text Score: 39 |
| visual-reasoning-on-winoground | GPT-4V | Group Score: 37.75 Image Score: 42.5 Text Score: 54.5 |
| visual-reasoning-on-winoground | GPT-4V + CoCoT | Group Score: 44.5 Image Score: 49.5 Text Score: 58.5 |
| visual-reasoning-on-winoground | MMICL + CCoT | Group Score: 47.5 Image Score: 48 Text Score: 51 |
| visual-reasoning-on-winoground | Gemini | Group Score: 25 Image Score: 26 Text Score: 30.75 |
| visual-reasoning-on-winoground | MMICL + CoCoT | Group Score: 50.75 Image Score: 52.5 Text Score: 64.25 |
| visual-reasoning-on-winoground | OpenFlamingo + DDCoT | Group Score: 39 Image Score: 47.25 Text Score: 47.5 |
| visual-reasoning-on-winoground | MMICL + DDCoT | Group Score: 36.75 Image Score: 45 Text Score: 46.75 |
| visual-reasoning-on-winoground | OpenFlamingo + CCoT | Group Score: 20 Image Score: 27.5 Text Score: 42.5 |
| visual-reasoning-on-winoground | Gemini + DDCoT | Group Score: 23.75 Image Score: 25 Text Score: 45 |
| visual-reasoning-on-winoground | Gemini + CCoT | Group Score: 20.75 Image Score: 33 Text Score: 22.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.