HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Jarvis Guo Tuney Zheng Yuelin Bai Bo Li Yubo Wang King Zhu Yizhi Li Graham Neubig Wenhu Chen Xiang Yue

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at
  Scale

Abstract

Open-source multimodal large language models (MLLMs) have shown significantpotential in a broad range of multimodal tasks. However, their reasoningcapabilities remain constrained by existing instruction-tuning datasets, whichwere predominately repurposed from academic datasets such as VQA, AI2D, andChartQA. These datasets target simplistic tasks, and only provide phrase-levelanswers without any intermediate rationales. To address these challenges, weintroduce a scalable and cost-effective method to construct a large-scalemultimodal instruction-tuning dataset with rich intermediate rationalesdesigned to elicit CoT reasoning. Using only open models, we create a datasetcontaining 12M instruction-response pairs to cover diverse, reasoning-intensivetasks with detailed and faithful rationales. Experiments demonstrate thattraining MLLMs on this dataset significantly improves reasoning capabilities,achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%),MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstratesnotable improvements of up to 4% on non-reasoning-based benchmarks. Ablationstudies further highlight the importance of key components, such as rewritingand self-filtering, in the dataset construction process.

Code Repositories

mammoth-vl/mammoth-vl
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
visual-question-answering-on-mm-vetMAmmoTH-VL-8B (SI)
GPT-4 score: 60.6
visual-question-answering-on-mm-vetMAmmoTH-VL-8B
GPT-4 score: 62.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale | Papers | HyperAI