3 months ago

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Qingyu Ren Qianyu He Bowei Zhang Jie Zeng Jiaqing Liang Yanghua Xiao Weikang Zhou Zeye Sun Fei Yu

Abstract

Reasoning models excel in complex problem solving but exhibit a concerningtrade off between reasoning capabilities and instruction following abilities.Existing approaches for improving instruction following rely on strongerexternal models, creating methodological bottlenecks and practical limitationsincluding increased costs and accessibility constraints. We propose aself-supervised RL framework that leverages reasoning models' own internalsignals to improve instruction following capabilities without externalsupervision. Extensive experiments demonstrate that our framework significantlyimproves instruction following capabilities while maintaining reasoningperformance, offering a scalable and cost-effective approach to enhanceinstruction following in reasoning models. The data and code are publiclyavailable at https://github.com/Rainier-rq/verl-if.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Qingyu Ren Qianyu He Bowei Zhang Jie Zeng Jiaqing Liang Yanghua Xiao Weikang Zhou Zeye Sun Fei Yu

Abstract

Build AI with AI

Hyper Newsletters