HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Abstract

We introduce SeerAttention-R, a sparse attention framework specificallytailored for the long decoding of reasoning models. Extended fromSeerAttention, SeerAttention-R retains the design of learning attentionsparsity through a self-distilled gating mechanism, while removing querypooling to accommodate auto-regressive decoding. With a lightweight plug-ingating, SeerAttention-R is flexible and can be easily integrated into existingpretrained model without modifying the original parameters. We demonstrate thatSeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoningaccuracy with 4K token budget in AIME benchmark under large sparse attentionblock sizes (64/128). Using TileLang, we develop a highly optimized sparsedecoding kernel that achieves near-theoretical speedups of up to 9x overFlashAttention-3 on H100 GPU at 90% sparsity. Code is available at:https://github.com/microsoft/SeerAttention.

Code Repositories

microsoft/seerattention
Official
pytorch
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning | Papers | HyperAI