Command Palette
Search for a command to run...
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers
Yilun Zhao Chengye Wang Chuhan Li Arman Cohan

Abstract
This paper introduces MISS-QA, the first benchmark specifically designed toevaluate the ability of models to interpret schematic diagrams withinscientific literature. MISS-QA comprises 1,500 expert-annotated examples over465 scientific papers. In this benchmark, models are tasked with interpretingschematic diagrams that illustrate research overviews and answeringcorresponding information-seeking questions based on the broader context of thepaper. We assess the performance of 18 frontier multimodal foundation models,including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significantperformance gap between these models and human experts on MISS-QA. Our analysisof model performance on unanswerable questions and our detailed error analysisfurther highlight the strengths and limitations of current models, offering keyinsights to enhance models in comprehending multimodal scientific literature.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.