7 months ago

Maggie Huan Yuetai Li Tuney Zheng Xiaoyu Xu Seungone Kim Minxin Du Radha Poovendran Graham Neubig Xiang Yue

Abstract

Math reasoning has become the poster child of progress in large languagemodels (LLMs), with new models rapidly surpassing human-level performance onbenchmarks like MATH and AIME. But as math leaderboards improve week by week,it is worth asking: do these gains reflect broader problem-solving ability orjust narrow overfitting? To answer this question, we evaluate over 20open-weight reasoning-tuned models across a broad suite of tasks, includingmath, scientific QA, agent planning, coding, and standardinstruction-following. We surprisingly find that most models that succeed inmath fail to transfer their gains to other domains. To rigorously study thisphenomenon, we conduct controlled experiments on Qwen3-14B models usingmath-only data but different tuning methods. We find that reinforcementlearning (RL)-tuned models generalize well across domains, while supervisedfine-tuning (SFT)-tuned models often forget general capabilities. Latent-spacerepresentation and token-space distribution shift analyses reveal that SFTinduces substantial representation and output drift, while RL preservesgeneral-domain structure. Our results suggest a need to rethink standardpost-training recipes, particularly the reliance on SFT-distilled data foradvancing reasoning models.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

7 months ago

Supervised Fine-Tuning

Model Training

Multi-Task Learning

Method/Architecture

Maggie Huan Yuetai Li Tuney Zheng Xiaoyu Xu Seungone Kim Minxin Du Radha Poovendran Graham Neubig Xiang Yue

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

7 months ago

Supervised Fine-Tuning

Model Training

Multi-Task Learning

Method/Architecture

Maggie Huan Yuetai Li Tuney Zheng Xiaoyu Xu Seungone Kim Minxin Du Radha Poovendran Graham Neubig Xiang Yue

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan Yuetai Li Tuney Zheng Xiaoyu Xu Seungone Kim Minxin Du Radha Poovendran Graham Neubig Xiang Yue

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan Yuetai Li Tuney Zheng Xiaoyu Xu Seungone Kim Minxin Du Radha Poovendran Graham Neubig Xiang Yue

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Maggie Huan Yuetai Li Tuney Zheng Xiaoyu Xu Seungone Kim Minxin Du Radha Poovendran Graham Neubig Xiang Yue

Abstract

Build AI with AI

HyperAI Newsletters