RLKV

Which Heads Matter for Reasoning?
RL-Guided KV Cache Compression

¹Westlake University, ²McGill University, ³Mila, ⁴Zhejiang University, and ⁵MBZUAI

^*Corresponding author: wanghuan [at] westlake [dot] edu [dot] cn

Abstract

Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head's cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.

Overview of RLKV

Our method proposes to utilize RL to identify reasoning heads. The RL pipeline naturally captures reasoning behaviors, since it samples the current model's generations to produce reward signals. The reward function evaluates the samples to assess reasoning quality. We employ L x H learnable gating adapters to mix full attention and local attention for each head, quantifying each head's reliance on full versus local KV cache access. We apply an L1 penalty to encourage adapter sparsity, while RL optimizes the adapters to preserve reasoning behaviors. After training, we identify reasoning heads with high adapter values and allocate full KV cache to them while applying compressed KV cache to others for efficient inference.

Main Results

Performance comparison of RLKV against KV cache compression baselines across reasoning benchmarks. We evaluate RLKV (Ours) and existing methods on two reasoning models (Llama-3.1-8B-R1 and Qwen-2.5-7B-R1) across four benchmarks (GSM8K, MATH, AIME24, MBPP) at sparsity levels of 0.2, 0.4, 0.6, and 0.8. RLKV consistently outperforms all baselines across different sparsity levels, demonstrating particularly strong advantages at high sparsity levels (0.4 or 0.6) where competing methods suffer significant performance degradation.

RLKV achieves near lossless performance (full KV cache) up to the sparsity thresholds shown for Llama-3.1-8B-R1 (a) and Qwen-2.5-7B-R1 (b) across four benchmarks. Red background denotes performance below the full-KV-cache baseline, whereas green background denotes performance above it. RLKV exhibits the smallest performance degradation among the other methods and, on some benchmarks, even improves over the full-KV-cache baseline. For all values, higher is better. The best result of the metric in each benchmark is in bold. All values are reported as percentages.

The importance of heads identified is equivalently illustrated by replacing the top ratio of them with a compressed KV cache. Compared to retrieval heads and random heads, reasoning heads identified by RLKV are more crucial to model performance, and are sensitive to compressed KV cache access.

The analysis reveals distinct error modes when reasoning heads versus retrieval heads work with compressed KV cache on Math500 benchmark. Reasoning heads tend toward repetitive generation errors as compression increases, while retrieval heads exhibit more varied error modes across different settings.

@article{du2025whichheads, title={Which Heads Matter for Reasoning? RL-Guided KV Cache Compression}, author={Du, Wenjie and Jiang, Li and Tao, Keda and Liu, Xue and Wang, Huan}, journal={arXiv preprint arXiv:2510.08525}, year={2025}, }

Which Heads Matter for Reasoning?RL-Guided KV Cache Compression

Abstract

Overview of RLKV

Main Results

The importance of heads identified is equivalently illustrated by replacing the top ratio of them with a compressed KV cache. Compared to retrieval heads and random heads, reasoning heads identified by RLKV are more crucial to model performance, and are sensitive to compressed KV cache access.

BibTeX

Which Heads Matter for Reasoning?
RL-Guided KV Cache Compression