7 months ago

Mingqi Wu Zhihao Zhang Qiaole Dong Zhiheng Xi Jun Zhao Senjie Jin Xiaoran Fan Yuhao Zhou Yanwei Fu Qin Liu

Abstract

The reasoning capabilities of large language models (LLMs) have been alongstanding focus of research. Recent works have further enhanced thesecapabilities using reinforcement learning (RL), with many new methods claimingsignificant improvements with minimal or no external supervision. Surprisingly,some studies even suggest that random or incorrect reward signals can enhancereasoning performance. However, these breakthroughs are mostly reported on theQwen2.5 model family and evaluated on well-known benchmarks such as MATH-500,AMC, and AIME, while failing to achieve similar gains on other models likeLlama, which warrants further investigation. Our analysis shows that althoughQwen2.5 achieves strong mathematical reasoning performance, its pretraining onlarge-scale web corpora makes it vulnerable to data contamination in popularbenchmarks. As a result, results derived from these benchmarks may beunreliable. To address this, we introduce a generator that produces fullysynthetic arithmetic problems of arbitrary length and difficulty, yielding aclean dataset we call RandomCalculation. Using these leakage-free datasets, weshow that only accurate reward signals consistently improve performance, whilenoisy or incorrect signals do not. We advocate for evaluating RL methods onuncontaminated benchmarks and across diverse model families to ensuretrustworthy conclusions.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

7 months ago

Benchmarks

Reinforcement Learning

Dataset

AI Infra

Method/Architecture

Mingqi Wu Zhihao Zhang Qiaole Dong Zhiheng Xi Jun Zhao Senjie Jin Xiaoran Fan Yuhao Zhou Yanwei Fu Qin Liu

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

7 months ago

Benchmarks

Reinforcement Learning

Dataset

AI Infra

Method/Architecture

Mingqi Wu Zhihao Zhang Qiaole Dong Zhiheng Xi Jun Zhao Senjie Jin Xiaoran Fan Yuhao Zhou Yanwei Fu Qin Liu

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Mingqi Wu Zhihao Zhang Qiaole Dong Zhiheng Xi Jun Zhao Senjie Jin Xiaoran Fan Yuhao Zhou Yanwei Fu Qin Liu2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Mingqi Wu Zhihao Zhang Qiaole Dong Zhiheng Xi Jun Zhao Senjie Jin Xiaoran Fan Yuhao Zhou Yanwei Fu Qin Liu2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Mingqi Wu Zhihao Zhang Qiaole Dong Zhiheng Xi Jun Zhao Senjie Jin Xiaoran Fan Yuhao Zhou Yanwei Fu Qin Liu2 more

Abstract

Build AI with AI

HyperAI Newsletters

Mingqi Wu Zhihao Zhang Qiaole Dong Zhiheng Xi Jun Zhao Senjie Jin Xiaoran Fan Yuhao Zhou Yanwei Fu Qin Liu

Mingqi Wu Zhihao Zhang Qiaole Dong Zhiheng Xi Jun Zhao Senjie Jin Xiaoran Fan Yuhao Zhou Yanwei Fu Qin Liu

Mingqi Wu Zhihao Zhang Qiaole Dong Zhiheng Xi Jun Zhao Senjie Jin Xiaoran Fan Yuhao Zhou Yanwei Fu Qin Liu