arxiv:2602.03152

FASA: Frequency-aware Sparse Attention

Published on Feb 3

· Submitted by

xiaochonglinghu on Feb 5

alibaba-inc

Upvote

109

Authors:

Abstract

FASA is a novel framework that uses query-aware token eviction and functional sparsity in RoPE to reduce KV cache memory usage while maintaining high performance in long-context LLM tasks.

AI-generated summary

The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. %making them a powerful and efficient proxy for token importance. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. % Since accessing only a small fraction of the KV cache, FASA drastically lowers memory bandwidth requirements and computational cost. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56times speedup using just 18.9\% of the cache on AIME24.

View arXiv page View PDF Add to collection

Community

xiaochonglinghu

Paper submitter about 21 hours ago

[ICLR26] A very interesting and effective work to speed up the inference of large models!

xiaochonglinghu

Paper submitter about 20 hours ago

[ICLR26] A very interesting and effective work to speed up the inference of large models!

jack07240724

about 18 hours ago

This work precisely targets the most critical and practical bottleneck in long-context inference—the KV-cache memory-bandwidth pressure during decoding. Compared to optimizations that focus solely on FLOPs, its memory-traffic-oriented approach is more aligned with real-world deployment and end-to-end speedups. Overall, this is an excellent piece of research and I enjoyed reading it.