5 months ago

Abstract

Building on the success of text-based reasoning models like DeepSeek-R1,extending these capabilities to multimodal reasoning holds great promise. Whilerecent works have attempted to adapt DeepSeek-R1-style reinforcement learning(RL) training paradigms to multimodal large language models (MLLM), focusing ondomain-specific tasks like math and visual perception, a critical questionremains: How can we achieve the general-purpose visual-language reasoningthrough RL? To address this challenge, we make three key efforts: (1) A novelScalable Multimodal QA Synthesis pipeline that autonomously generatescontext-aware, reasoning-centric question-answer (QA) pairs directly from thegiven images. (2) The open-source WeThink dataset containing over 120Kmultimodal QA pairs with annotated reasoning paths, curated from 18 diversedataset sources and covering various question domains. (3) A comprehensiveexploration of RL on our dataset, incorporating a hybrid reward mechanism thatcombines rule-based verification with model-based assessment to optimize RLtraining efficiency across various task domains. Across 14 diverse MLLMbenchmarks, we demonstrate that our WeThink dataset significantly enhancesperformance, from mathematical reasoning to diverse general multimodal tasks.Moreover, we show that our automated data pipeline can continuously increasedata diversity to further improve model performance.

Source PDF