HyperAI

Abstract

Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.

Abstract

Taekhyun Park Yongjae Lee Hyerim Bae

Abstract

Build AI with AI

HyperAI Newsletters

Taekhyun Park Yongjae Lee Hyerim Bae

Abstract

Build AI with AI

HyperAI Newsletters

Taekhyun Park Yongjae Lee Hyerim Bae

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Reasoning-Aware GRPO using Process Mining

Taekhyun Park Yongjae Lee Hyerim Bae

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Reasoning-Aware GRPO using Process Mining

Taekhyun Park Yongjae Lee Hyerim Bae

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Reasoning-Aware GRPO using Process Mining

Taekhyun Park Yongjae Lee Hyerim Bae

Abstract

Build AI with AI

HyperAI Newsletters