HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Qingyu Ren Qianyu He Bowei Zhang Jie Zeng Jiaqing Liang Yanghua Xiao Weikang Zhou Zeye Sun Fei Yu

Beyond the Trade-off: Self-Supervised Reinforcement Learning for
  Reasoning Models' Instruction Following

Abstract

Reasoning models excel in complex problem solving but exhibit a concerningtrade off between reasoning capabilities and instruction following abilities.Existing approaches for improving instruction following rely on strongerexternal models, creating methodological bottlenecks and practical limitationsincluding increased costs and accessibility constraints. We propose aself-supervised RL framework that leverages reasoning models' own internalsignals to improve instruction following capabilities without externalsupervision. Extensive experiments demonstrate that our framework significantlyimproves instruction following capabilities while maintaining reasoningperformance, offering a scalable and cost-effective approach to enhanceinstruction following in reasoning models. The data and code are publiclyavailable at https://github.com/Rainier-rq/verl-if.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following | Papers | HyperAI