HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

Beyond Homogeneous Attention: Memory-Efficient LLMs via
  Fourier-Approximated KV Cache

Abstract

Large Language Models struggle with memory demands from the growing Key-Value(KV) cache as context lengths increase. Existing compression methods homogenizehead dimensions or rely on attention-guided token pruning, often sacrificingaccuracy or introducing computational overhead. We propose FourierAttention, atraining-free framework that exploits the heterogeneous roles of transformerhead dimensions: lower dimensions prioritize local context, while upper onescapture long-range dependencies. By projecting the long-context-insensitivedimensions onto orthogonal Fourier bases, FourierAttention approximates theirtemporal evolution with fixed-length spectral coefficients. Evaluations onLLaMA models show that FourierAttention achieves the best long-context accuracyon LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel,FlashFourierAttention, is designed to optimize memory via streamlinedread-write operations, enabling efficient deployment without performancecompromise.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache | Papers | HyperAI