HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning

WeThink: Toward General-purpose Vision-Language Reasoning via
  Reinforcement Learning

Abstract

Building on the success of text-based reasoning models like DeepSeek-R1,extending these capabilities to multimodal reasoning holds great promise. Whilerecent works have attempted to adapt DeepSeek-R1-style reinforcement learning(RL) training paradigms to multimodal large language models (MLLM), focusing ondomain-specific tasks like math and visual perception, a critical questionremains: How can we achieve the general-purpose visual-language reasoningthrough RL? To address this challenge, we make three key efforts: (1) A novelScalable Multimodal QA Synthesis pipeline that autonomously generatescontext-aware, reasoning-centric question-answer (QA) pairs directly from thegiven images. (2) The open-source WeThink dataset containing over 120Kmultimodal QA pairs with annotated reasoning paths, curated from 18 diversedataset sources and covering various question domains. (3) A comprehensiveexploration of RL on our dataset, incorporating a hybrid reward mechanism thatcombines rule-based verification with model-based assessment to optimize RLtraining efficiency across various task domains. Across 14 diverse MLLMbenchmarks, we demonstrate that our WeThink dataset significantly enhancesperformance, from mathematical reasoning to diverse general multimodal tasks.Moreover, we show that our automated data pipeline can continuously increasedata diversity to further improve model performance.

Code Repositories

yangjie-cv/wethink
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
optical-character-recognition-on-ocrbench-v2-chineseWeThink-Qwen2.5VL-7B
Accuracy: 55.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning | Papers | HyperAI