8 months ago

Abstract

Large multimodal models (LMMs) extend large language models (LLMs) withmulti-sensory skills, such as visual understanding, to achieve stronger genericintelligence. In this paper, we analyze the latest model, GPT-4V(ision), todeepen the understanding of LMMs. The analysis focuses on the intriguing tasksthat GPT-4V can perform, containing test samples to probe the quality andgenericity of GPT-4V's capabilities, its supported inputs and working modes,and the effective ways to prompt the model. In our approach to exploringGPT-4V, we curate and organize a collection of carefully designed qualitativesamples spanning a variety of domains and tasks. Observations from thesesamples demonstrate that GPT-4V's unprecedented ability in processingarbitrarily interleaved multimodal inputs and the genericity of itscapabilities together make GPT-4V a powerful multimodal generalist system.Furthermore, GPT-4V's unique capability of understanding visual markers drawnon input images can give rise to new human-computer interaction methods such asvisual referring prompting. We conclude the report with in-depth discussions onthe emerging application scenarios and the future research directions forGPT-4V-based systems. We hope that this preliminary exploration will inspirefuture research on the next-generation multimodal task formulation, new ways toexploit and enhance LMMs to solve real-world problems, and gaining betterunderstanding of multimodal foundation models. Finally, we acknowledge that themodel under our study is solely the product of OpenAI's innovative work, andthey should be fully credited for its development. Please see the GPT-4Vcontributions paper for the authorship and credit attribution:https://cdn.openai.com/contributions/gpt-4v.pdf

Source PDF View Code