HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Reasoning or Memorization? Unreliable Results of Reinforcement Learning
  Due to Data Contamination

Abstract

The reasoning capabilities of large language models (LLMs) have been alongstanding focus of research. Recent works have further enhanced thesecapabilities using reinforcement learning (RL), with many new methods claimingsignificant improvements with minimal or no external supervision. Surprisingly,some studies even suggest that random or incorrect reward signals can enhancereasoning performance. However, these breakthroughs are mostly reported on theQwen2.5 model family and evaluated on well-known benchmarks such as MATH-500,AMC, and AIME, while failing to achieve similar gains on other models likeLlama, which warrants further investigation. Our analysis shows that althoughQwen2.5 achieves strong mathematical reasoning performance, its pretraining onlarge-scale web corpora makes it vulnerable to data contamination in popularbenchmarks. As a result, results derived from these benchmarks may beunreliable. To address this, we introduce a generator that produces fullysynthetic arithmetic problems of arbitrary length and difficulty, yielding aclean dataset we call RandomCalculation. Using these leakage-free datasets, weshow that only accurate reward signals consistently improve performance, whilenoisy or incorrect signals do not. We advocate for evaluating RL methods onuncontaminated benchmarks and across diverse model families to ensuretrustworthy conclusions.

Code Repositories

wumingqi/LLM-Math-Evaluation
Official
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination | Papers | HyperAI