5 months ago

Yujun Zhou Zhenwen Liang Haolin Liu Wenhao Yu Kishan Panaganti Linfeng Song Dian Yu Xiangliang Zhang Haitao Mi Dong Yu

Abstract

Large language models (LLMs) are increasingly trained with reinforcementlearning from verifiable rewards (RLVR), yet real-world deployment demandsmodels that can self-improve without labels or external judges. Existinglabel-free methods, confidence minimization, self-consistency, or majority-voteobjectives, stabilize learning but steadily shrink exploration, causing anentropy collapse: generations become shorter, less diverse, and brittle. Unlikeprior approaches such as Test-Time Reinforcement Learning (TTRL), whichprimarily adapt models to the immediate unlabeled dataset at hand, our goal isbroader: to enable general improvements without sacrificing the model'sinherent exploration capacity and generalization ability, i.e., evolving. Weformalize this issue and propose EVolution-Oriented and Label-freeReinforcement Learning (EVOL-RL), a simple rule that couples stability withvariation under a label-free setting. EVOL-RL keeps the majority-voted answeras a stable anchor (selection) while adding a novelty-aware reward that favorsresponses whose reasoning differs from what has already been produced(variation), measured in semantic space. Implemented with GRPO, EVOL-RL alsouses asymmetric clipping to preserve strong signals and an entropy regularizerto sustain search. This majority-for-selection + novelty-for-variation designprevents collapse, maintains longer and more informative chains of thought, andimproves both pass@1 and pass@n. EVOL-RL consistently outperforms themajority-only TTRL baseline; e.g., training on label-free AIME24 liftsQwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5%to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocksstronger generalization across domains (e.g., GPQA). Furthermore, wedemonstrate that EVOL-RL also boosts performance in the RLVR setting,highlighting its broad applicability.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

5 months ago

Reinforcement Learning

Reasoning

Model Training

Method/Architecture

Yujun Zhou Zhenwen Liang Haolin Liu Wenhao Yu Kishan Panaganti Linfeng Song Dian Yu Xiangliang Zhang Haitao Mi Dong Yu

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

5 months ago

Reinforcement Learning

Reasoning

Model Training

Method/Architecture

Yujun Zhou Zhenwen Liang Haolin Liu Wenhao Yu Kishan Panaganti Linfeng Song Dian Yu Xiangliang Zhang Haitao Mi Dong Yu

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Yujun Zhou Zhenwen Liang Haolin Liu Wenhao Yu Kishan Panaganti Linfeng Song Dian Yu Xiangliang Zhang Haitao Mi Dong Yu

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Yujun Zhou Zhenwen Liang Haolin Liu Wenhao Yu Kishan Panaganti Linfeng Song Dian Yu Xiangliang Zhang Haitao Mi Dong Yu

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Yujun Zhou Zhenwen Liang Haolin Liu Wenhao Yu Kishan Panaganti Linfeng Song Dian Yu Xiangliang Zhang Haitao Mi Dong Yu

Abstract

Build AI with AI

HyperAI Newsletters