5 months ago

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

Xiaoran Liu Siyang He Qiqi Wang Ruixiao Li Yuerong Song Zhigeng Liu Linlin Li Qun Liu Zengfeng Huang Qipeng Guo

Abstract

Large Language Models struggle with memory demands from the growing Key-Value(KV) cache as context lengths increase. Existing compression methods homogenizehead dimensions or rely on attention-guided token pruning, often sacrificingaccuracy or introducing computational overhead. We propose FourierAttention, atraining-free framework that exploits the heterogeneous roles of transformerhead dimensions: lower dimensions prioritize local context, while upper onescapture long-range dependencies. By projecting the long-context-insensitivedimensions onto orthogonal Fourier bases, FourierAttention approximates theirtemporal evolution with fixed-length spectral coefficients. Evaluations onLLaMA models show that FourierAttention achieves the best long-context accuracyon LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel,FlashFourierAttention, is designed to optimize memory via streamlinedread-write operations, enabling efficient deployment without performancecompromise.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

Xiaoran Liu Siyang He Qiqi Wang Ruixiao Li Yuerong Song Zhigeng Liu Linlin Li Qun Liu Zengfeng Huang Qipeng Guo2 more

Abstract

Build AI with AI

Hyper Newsletters

Xiaoran Liu Siyang He Qiqi Wang Ruixiao Li Yuerong Song Zhigeng Liu Linlin Li Qun Liu Zengfeng Huang Qipeng Guo