HyperAIHyperAI

Command Palette

Search for a command to run...

12 days ago

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

Every Attention Matters: An Efficient Hybrid Architecture for
  Long-Context Reasoning

Abstract

In this technical report, we present the Ring-linear model series,specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0.Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, whileRing-flash-linear-2.0 contains 104B parameters and 6.1B activations. Bothmodels adopt a hybrid architecture that effectively integrates linear attentionand softmax attention, significantly reducing I/O and computational overhead inlong-context inference scenarios. Compared to a 32 billion parameter densemodel, this series reduces inference cost to 1/10, and compared to the originalRing series, the cost is also reduced by over 50%. Furthermore, throughsystematic exploration of the ratio between different attention mechanisms inthe hybrid architecture, we have identified the currently optimal modelstructure. Additionally, by leveraging our self-developed high-performance FP8operator library-linghe, overall training efficiency has been improved by 50%.Benefiting from the high alignment between the training and inference engineoperators, the models can undergo long-term, stable, and highly efficientoptimization during the reinforcement learning phase, consistently maintainingSOTA performance across multiple challenging complex reasoning benchmarks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning | Papers | HyperAI