4 months ago

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Jarvis Guo Tuney Zheng Yuelin Bai Bo Li Yubo Wang King Zhu Yizhi Li Graham Neubig Wenhu Chen Xiang Yue

Abstract

Open-source multimodal large language models (MLLMs) have shown significantpotential in a broad range of multimodal tasks. However, their reasoningcapabilities remain constrained by existing instruction-tuning datasets, whichwere predominately repurposed from academic datasets such as VQA, AI2D, andChartQA. These datasets target simplistic tasks, and only provide phrase-levelanswers without any intermediate rationales. To address these challenges, weintroduce a scalable and cost-effective method to construct a large-scalemultimodal instruction-tuning dataset with rich intermediate rationalesdesigned to elicit CoT reasoning. Using only open models, we create a datasetcontaining 12M instruction-response pairs to cover diverse, reasoning-intensivetasks with detailed and faithful rationales. Experiments demonstrate thattraining MLLMs on this dataset significantly improves reasoning capabilities,achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%),MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstratesnotable improvements of up to 4% on non-reasoning-based benchmarks. Ablationstudies further highlight the importance of key components, such as rewritingand self-filtering, in the dataset construction process.

Code Repositories

mammoth-vl/mammoth-vl

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
visual-question-answering-on-mm-vet	MAmmoTH-VL-8B (SI)	GPT-4 score: 60.6
visual-question-answering-on-mm-vet	MAmmoTH-VL-8B	GPT-4 score: 62.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette