5 months ago

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Gao Yizhao Guo Shuming Cao Shijie Xia Yuqing Cheng Yu

Abstract

We introduce SeerAttention-R, a sparse attention framework specificallytailored for the long decoding of reasoning models. Extended fromSeerAttention, SeerAttention-R retains the design of learning attentionsparsity through a self-distilled gating mechanism, while removing querypooling to accommodate auto-regressive decoding. With a lightweight plug-ingating, SeerAttention-R is flexible and can be easily integrated into existingpretrained model without modifying the original parameters. We demonstrate thatSeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoningaccuracy with 4K token budget in AIME benchmark under large sparse attentionblock sizes (64/128). Using TileLang, we develop a highly optimized sparsedecoding kernel that achieves near-theoretical speedups of up to 9x overFlashAttention-3 on H100 GPU at 90% sparsity. Code is available at:https://github.com/microsoft/SeerAttention.

Code Repositories

microsoft/seerattention

Official

pytorch

Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette