Command Palette
Search for a command to run...
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Jarvis Guo Tuney Zheng Yuelin Bai Bo Li Yubo Wang King Zhu Yizhi Li Graham Neubig Wenhu Chen Xiang Yue

Abstract
Open-source multimodal large language models (MLLMs) have shown significantpotential in a broad range of multimodal tasks. However, their reasoningcapabilities remain constrained by existing instruction-tuning datasets, whichwere predominately repurposed from academic datasets such as VQA, AI2D, andChartQA. These datasets target simplistic tasks, and only provide phrase-levelanswers without any intermediate rationales. To address these challenges, weintroduce a scalable and cost-effective method to construct a large-scalemultimodal instruction-tuning dataset with rich intermediate rationalesdesigned to elicit CoT reasoning. Using only open models, we create a datasetcontaining 12M instruction-response pairs to cover diverse, reasoning-intensivetasks with detailed and faithful rationales. Experiments demonstrate thattraining MLLMs on this dataset significantly improves reasoning capabilities,achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%),MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstratesnotable improvements of up to 4% on non-reasoning-based benchmarks. Ablationstudies further highlight the importance of key components, such as rewritingand self-filtering, in the dataset construction process.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-question-answering-on-mm-vet | MAmmoTH-VL-8B (SI) | GPT-4 score: 60.6 |
| visual-question-answering-on-mm-vet | MAmmoTH-VL-8B | GPT-4 score: 62.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.