HyperAIHyperAI

Command Palette

Search for a command to run...

SparseMM, a Strategy for KV-Cache Optimization Using Visual Head Sparsity

Date

3 months ago

The strategy of optimizing KV-Cache by using the sparsity of visual heads (Sparsity Emerges from Visual Concept Responses in MLLMs, referred to as SparseMM) is a key-value cache optimization strategy proposed by the Intelligent Vision Laboratory of Tsinghua University and Tencent Hunyuan X Group on June 5, 2025. It allocates an asymmetric computing budget to each attention head in the large language model according to the visual score. The related paper results are "SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs".

Compared with previous methods, SparseMM prioritizes and preserves visual semantics during decoding. Extensive evaluations on mainstream multimodal benchmarks show that SparseMM achieves a better accuracy-efficiency trade-off. In the efficiency test, SparseMM achieves 1.38x real-time speedup and 52% memory reduction while maintaining comparable performance.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SparseMM, a Strategy for KV-Cache Optimization Using Visual Head Sparsity | Wiki | HyperAI