HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Senqiao Yang Junyi Li Xin Lai Bei Yu Hengshuang Zhao Jiaya Jia

VisionThink: Smart and Efficient Vision Language Model via Reinforcement
  Learning

Abstract

Recent advancements in vision-language models (VLMs) have improvedperformance by increasing the number of visual tokens, which are oftensignificantly longer than text tokens. However, we observe that most real-worldscenarios do not require such an extensive number of visual tokens. While theperformance drops significantly in a small subset of OCR-related tasks, modelsstill perform accurately in most other general VQA tasks with only 1/4resolution. Therefore, we propose to dynamically process distinct samples withdifferent resolutions, and present a new paradigm for visual token compression,namely, VisionThink. It starts with a downsampled image and smartly decideswhether it is sufficient for problem solving. Otherwise, the model could outputa special token to request the higher-resolution image. Compared to existingEfficient VLM methods that compress tokens using fixed pruning ratios orthresholds, VisionThink autonomously decides whether to compress tokens case bycase. As a result, it demonstrates strong fine-grained visual understandingcapability on OCR-related tasks, and meanwhile saves substantial visual tokenson simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judgestrategy to successfully apply RL to general VQA tasks. Moreover, we carefullydesign a reward function and penalty mechanism to achieve a stable andreasonable image resize call ratio. Extensive experiments demonstrate thesuperiority, efficiency, and effectiveness of our method. Our code is availableat https://github.com/dvlab-research/VisionThink.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning | Papers | HyperAI