HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

Zhang Daoan ; Yang Junming ; Lyu Hanjia ; Jin Zijian ; Yao Yuan ; Chen Mingkai ; Luo Jiebo

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal
  Models with Multiple Image Inputs

Abstract

When exploring the development of Artificial General Intelligence (AGI), acritical task for these models involves interpreting and processing informationfrom multiple image inputs. However, Large Multimodal Models (LMMs) encountertwo issues in such scenarios: (1) a lack of fine-grained perception, and (2) atendency to blend information across multiple images. We first extensivelyinvestigate the capability of LMMs to perceive fine-grained visual details whendealing with multiple input images. The research focuses on two aspects: first,image-to-image matching (to evaluate whether LMMs can effectively reason andpair relevant images), and second, multi-image-to-text matching (to assesswhether LMMs can accurately capture and summarize detailed image information).We conduct evaluations on a range of both open-source and closed-source largemodels, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance modelperformance, we further develop a Contrastive Chain-of-Thought (CoCoT)prompting approach based on multi-input multimodal models. This method requiresLMMs to compare the similarities and differences among multiple image inputs,and then guide the models to answer detailed questions about multi-image inputsbased on the identified similarities and differences. Our experimental resultsshowcase CoCoT's proficiency in enhancing the multi-image comprehensioncapabilities of large multimodal models.

Code Repositories

vista-h/gpt-4v_social_media
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
visual-reasoning-on-winogroundGemini + CoCoT
Group Score: 27.75
Image Score: 32.5
Text Score: 40
visual-reasoning-on-winogroundOpenFlamingo + CoCoT
Group Score: 41.5
Image Score: 55.25
Text Score: 58.25
visual-reasoning-on-winogroundOpenFlamingo
Group Score: 33.25
Image Score: 41.25
Text Score: 39
visual-reasoning-on-winogroundGPT-4V
Group Score: 37.75
Image Score: 42.5
Text Score: 54.5
visual-reasoning-on-winogroundGPT-4V + CoCoT
Group Score: 44.5
Image Score: 49.5
Text Score: 58.5
visual-reasoning-on-winogroundMMICL + CCoT
Group Score: 47.5
Image Score: 48
Text Score: 51
visual-reasoning-on-winogroundGemini
Group Score: 25
Image Score: 26
Text Score: 30.75
visual-reasoning-on-winogroundMMICL + CoCoT
Group Score: 50.75
Image Score: 52.5
Text Score: 64.25
visual-reasoning-on-winogroundOpenFlamingo + DDCoT
Group Score: 39
Image Score: 47.25
Text Score: 47.5
visual-reasoning-on-winogroundMMICL + DDCoT
Group Score: 36.75
Image Score: 45
Text Score: 46.75
visual-reasoning-on-winogroundOpenFlamingo + CCoT
Group Score: 20
Image Score: 27.5
Text Score: 42.5
visual-reasoning-on-winogroundGemini + DDCoT
Group Score: 23.75
Image Score: 25
Text Score: 45
visual-reasoning-on-winogroundGemini + CCoT
Group Score: 20.75
Image Score: 33
Text Score: 22.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp