8 months ago

Abstract

When exploring the development of Artificial General Intelligence (AGI), acritical task for these models involves interpreting and processing informationfrom multiple image inputs. However, Large Multimodal Models (LMMs) encountertwo issues in such scenarios: (1) a lack of fine-grained perception, and (2) atendency to blend information across multiple images. We first extensivelyinvestigate the capability of LMMs to perceive fine-grained visual details whendealing with multiple input images. The research focuses on two aspects: first,image-to-image matching (to evaluate whether LMMs can effectively reason andpair relevant images), and second, multi-image-to-text matching (to assesswhether LMMs can accurately capture and summarize detailed image information).We conduct evaluations on a range of both open-source and closed-source largemodels, including GPT-4V, Gemini, OpenFlamingo, and MMICL. To enhance modelperformance, we further develop a Contrastive Chain-of-Thought (CoCoT)prompting approach based on multi-input multimodal models. This method requiresLMMs to compare the similarities and differences among multiple image inputs,and then guide the models to answer detailed questions about multi-image inputsbased on the identified similarities and differences. Our experimental resultsshowcase CoCoT's proficiency in enhancing the multi-image comprehensioncapabilities of large multimodal models.

Source PDF