4 months ago

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Yilun Zhao Chengye Wang Chuhan Li Arman Cohan

Abstract

This paper introduces MISS-QA, the first benchmark specifically designed toevaluate the ability of models to interpret schematic diagrams withinscientific literature. MISS-QA comprises 1,500 expert-annotated examples over465 scientific papers. In this benchmark, models are tasked with interpretingschematic diagrams that illustrate research overviews and answeringcorresponding information-seeking questions based on the broader context of thepaper. We assess the performance of 18 frontier multimodal foundation models,including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significantperformance gap between these models and human experts on MISS-QA. Our analysisof model performance on unanswerable questions and our detailed error analysisfurther highlight the strengths and limitations of current models, offering keyinsights to enhance models in comprehending multimodal scientific literature.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Yilun Zhao Chengye Wang Chuhan Li Arman Cohan

Abstract

Build AI with AI

Hyper Newsletters