8 months ago

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effectivefor training large language models (LLMs) on complex reasoning tasks, such asmathematical problem solving. A prerequisite for the scalability of RLVR is ahigh-quality problem set with precise and verifiable answers. However, thescarcity of well-crafted human-labeled math problems and limited-verificationanswers in existing distillation-oriented synthetic datasets limit theireffectiveness in RL. Additionally, most problem synthesis strategiesindiscriminately expand the problem set without considering the model'scapabilities, leading to low efficiency in generating useful questions. Tomitigate this issue, we introduce a Self-aware Weakness-driven problemSynthesis framework (SwS) that systematically identifies model deficiencies andleverages them for problem augmentation. Specifically, we define weaknesses asquestions that the model consistently fails to learn through its iterativesampling during RL training. We then extract the core concepts from thesefailure cases and synthesize new problems to strengthen the model's weak areasin subsequent augmented training, enabling it to focus on and graduallyovercome its weaknesses. Without relying on external knowledge distillation,our framework enables robust generalization byempowering the model toself-identify and address its weaknesses in RL, yielding average performancegains of 10.0% and 7.7% on 7B and 32B models across eight mainstream reasoningbenchmarks.

Source PDF View Code