13 days ago

Inpainting-Guided Policy Optimization for Diffusion Large Language Models

Siyan Zhao Mengchen Liu Jing Huang Miao Liu Chenyu Wang Bo Liu Yuandong Tian Guan Pang Sean Bell Aditya Grover

Abstract

Masked diffusion large language models (dLLMs) are emerging as promisingalternatives to autoregressive LLMs, offering competitive performance whilesupporting unique generation capabilities such as inpainting. We explore howinpainting can inform RL algorithm design for dLLMs. Aligning LLMs withreinforcement learning faces an exploration challenge: sparse reward signalsand sample waste when models fail to discover correct solutions. While thisinefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity--theirinpainting ability can guide exploration. We introduce IGPO (Inpainting GuidedPolicy Optimization), an RL framework that strategically inserts partialground-truth reasoning traces during online sampling. Unlike providing fullsolutions, inpainting steers exploration toward promising trajectory spaceswhile preserving self-generated reasoning, bridging supervised fine-tuning andreinforcement learning. We apply IGPO to group-based optimization methods suchas GRPO, where exploration failures cause zero advantages and gradients. IGPOrestores meaningful gradients while improving sample efficiency. We alsopropose supervised fine-tuning on synthetically rewritten concise traces thatbetter align with dLLM generation patterns. With additional techniquesincluding entropy-based filtering, our training recipe yields substantial gainsacross three mathematical benchmarks--GSM8K, Math500, and AMC--achieving newstate-of-the-art results for full-attention masked dLLMs.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Inpainting-Guided Policy Optimization for Diffusion Large Language Models

Siyan Zhao Mengchen Liu Jing Huang Miao Liu Chenyu Wang Bo Liu Yuandong Tian Guan Pang Sean Bell Aditya Grover1 more

Abstract

Build AI with AI

Hyper Newsletters

Siyan Zhao Mengchen Liu Jing Huang Miao Liu Chenyu Wang Bo Liu Yuandong Tian Guan Pang Sean Bell Aditya Grover