HyperAIHyperAI

Command Palette

Search for a command to run...

13 days ago

Inpainting-Guided Policy Optimization for Diffusion Large Language Models

Inpainting-Guided Policy Optimization for Diffusion Large Language
  Models

Abstract

Masked diffusion large language models (dLLMs) are emerging as promisingalternatives to autoregressive LLMs, offering competitive performance whilesupporting unique generation capabilities such as inpainting. We explore howinpainting can inform RL algorithm design for dLLMs. Aligning LLMs withreinforcement learning faces an exploration challenge: sparse reward signalsand sample waste when models fail to discover correct solutions. While thisinefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity--theirinpainting ability can guide exploration. We introduce IGPO (Inpainting GuidedPolicy Optimization), an RL framework that strategically inserts partialground-truth reasoning traces during online sampling. Unlike providing fullsolutions, inpainting steers exploration toward promising trajectory spaceswhile preserving self-generated reasoning, bridging supervised fine-tuning andreinforcement learning. We apply IGPO to group-based optimization methods suchas GRPO, where exploration failures cause zero advantages and gradients. IGPOrestores meaningful gradients while improving sample efficiency. We alsopropose supervised fine-tuning on synthetically rewritten concise traces thatbetter align with dLLM generation patterns. With additional techniquesincluding entropy-based filtering, our training recipe yields substantial gainsacross three mathematical benchmarks--GSM8K, Math500, and AMC--achieving newstate-of-the-art results for full-attention masked dLLMs.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Inpainting-Guided Policy Optimization for Diffusion Large Language Models | Papers | HyperAI