8 months ago

Abstract

Ultra-long generation by large language models (LLMs) is a widely demandedscenario, yet it remains a significant challenge due to their maximumgeneration length limit and overall quality degradation as sequence lengthincreases. Previous approaches, exemplified by LongWriter, typically rely on''teaching'', which involves supervised fine-tuning (SFT) on syntheticlong-form outputs. However, this strategy heavily depends on synthetic SFTdata, which is difficult and costly to construct, often lacks coherence andconsistency, and tends to be overly artificial and structurally monotonous. Inthis work, we propose an incentivization-based approach that, starting entirelyfrom scratch and without relying on any annotated or synthetic data, leveragesreinforcement learning (RL) to foster the emergence of ultra-long, high-qualitytext generation capabilities in LLMs. We perform RL training starting from abase model, similar to R1-Zero, guiding it to engage in reasoning thatfacilitates planning and refinement during the writing process. To supportthis, we employ specialized reward models that steer the LLM towards improvedlength control, writing quality, and structural formatting. Experimentalevaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B,consistently outperforms traditional SFT methods on long-form writing tasks,achieving state-of-the-art results across all metrics on WritingBench andArena-Write, and even surpassing 100B+ models such as DeepSeek R1 andQwen3-235B. We open-source our data and model checkpoints underhttps://huggingface.co/THU-KEG/LongWriter-Zero-32B

Source PDF View Code