Command Palette
Search for a command to run...
Inpainting-Guided Policy Optimization for Diffusion Large Language Models

Abstract
Masked diffusion large language models (dLLMs) are emerging as promisingalternatives to autoregressive LLMs, offering competitive performance whilesupporting unique generation capabilities such as inpainting. We explore howinpainting can inform RL algorithm design for dLLMs. Aligning LLMs withreinforcement learning faces an exploration challenge: sparse reward signalsand sample waste when models fail to discover correct solutions. While thisinefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity--theirinpainting ability can guide exploration. We introduce IGPO (Inpainting GuidedPolicy Optimization), an RL framework that strategically inserts partialground-truth reasoning traces during online sampling. Unlike providing fullsolutions, inpainting steers exploration toward promising trajectory spaceswhile preserving self-generated reasoning, bridging supervised fine-tuning andreinforcement learning. We apply IGPO to group-based optimization methods suchas GRPO, where exploration failures cause zero advantages and gradients. IGPOrestores meaningful gradients while improving sample efficiency. We alsopropose supervised fine-tuning on synthetically rewritten concise traces thatbetter align with dLLM generation patterns. With additional techniquesincluding entropy-based filtering, our training recipe yields substantial gainsacross three mathematical benchmarks--GSM8K, Math500, and AMC--achieving newstate-of-the-art results for full-attention masked dLLMs.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.