Which Heads Matter for Reasoning?
RL-Guided KV Cache Compression

1Westlake University, 2McGill University, 3Mila, 4Zhejiang University, and 5MBZUAI
*Corresponding author: wanghuan [at] westlake [dot] edu [dot] cn

Westlake University
McGill University
Mila
Zhejiang University
MBZUAI

News

  • 2026/04: RLKV is supported in SGLang inference (see code).
  • 2026/03: RLKV is presented at the LIT @ ICLR 2026 workshop (Latent & Implicit Thinking — Going Beyond CoT Reasoning).
  • 2025/10: arXiv preprint is released.

Abstract

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20-50% cache reduction with near-lossless performance across diverse tasks and models.

Overview of RLKV

Our method uses RL as a probe to identify reasoning-critical heads. The RL pipeline naturally captures reasoning behaviors, since it samples the current model's generations to produce reward signals, and the reward function evaluates those samples to assess reasoning quality. We employ L × H learnable gating adapters to mix full attention and local attention for each head, quantifying each head's reliance on full versus local KV cache access. We apply an L1 penalty to encourage adapter sparsity, while RL optimizes the adapters to preserve reasoning behaviors. After training, we identify reasoning-critical heads with high adapter values and allocate full KV cache to them while applying compressed KV cache to others for efficient inference.

Method Overview

Main Results

BibTeX

@article{du2025whichheads,
  title={Which Heads Matter for Reasoning? RL-Guided KV Cache Compression},
  author={Du, Wenjie and Jiang, Li and Tao, Keda and Liu, Xue and Wang, Huan},
  journal={arXiv preprint arXiv:2510.08525},
  year={2025}
}