RLKV

Which Heads Matter for Reasoning?
RL-Guided KV Cache Compression

ICML 2026

¹Westlake University, ²McGill University, ³Mila, ⁴Zhejiang University, and ⁵MBZUAI

^*Corresponding author: wanghuan [at] westlake [dot] edu [dot] cn

News

2026/05: RLKV is accepted to ICML 2026! See you in Seoul!
2026/04: RLKV is supported in SGLang inference (see code).
2026/03: RLKV is presented at the LIT @ ICLR 2026 workshop (Latent & Implicit Thinking — Going Beyond CoT Reasoning).
2025/10: arXiv preprint is released.

Abstract

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20-50% cache reduction with near-lossless performance across diverse tasks and models.

Overview of RLKV

Our method uses RL as a probe to identify reasoning-critical heads. The RL pipeline naturally captures reasoning behaviors, since it samples the current model's generations to produce reward signals, and the reward function evaluates those samples to assess reasoning quality. We employ L × H learnable gating adapters to mix full attention and local attention for each head, quantifying each head's reliance on full versus local KV cache access. We apply an L1 penalty to encourage adapter sparsity, while RL optimizes the adapters to preserve reasoning behaviors. After training, we identify reasoning-critical heads with high adapter values and allocate full KV cache to them while applying compressed KV cache to others for efficient inference.

Main Results

Main Results. RLKV (Ours) achieves a better accuracy–efficiency trade-off across three reasoning models (Llama-3.1-8B-R1, Qwen-2.5-7B-R1, Qwen-3-4B-Thinking) over four reasoning benchmarks (GSM8K, MATH, AIME24, MBPP) and four MMLU-Pro subsets (Chemistry, Computer Science, Law, Physics) at sparsity levels 0.2, 0.4, 0.6, and 0.8. RLKV consistently outperforms all baselines, with particularly strong advantages at high sparsity where competing methods degrade substantially.

Near lossless performance achieved by RLKV across four reasoning benchmarks. The lossless sparsity threshold varies across benchmarks and models, capturing how task difficulty, type, and model itself jointly affect compression sensitivity. RLKV exhibits the smallest performance degradation.

Ablation of key components in RLKV training (Qwen-2.5-7B-R1 on Math500). Left: adaptive penalty weighting stabilizes training under sparsity, breaking the vicious cycle between sparse rewards and dense L1 penalty. Middle: self-distillation sampling yields more stable reward signals than training on overly difficult problems. Right: a moderate L1 penalty weight β=0.001 provides the best sparsity–performance trade-off.

Head sensitivity from high to low gating scores. We rank heads by the learned gating scores and progressively replace the top fraction with a compressed KV cache. Compared to randomly initialized heads and retrieval heads identified by DuoAttention, reasoning-critical heads identified by RLKV cause a significantly sharper performance drop when compressed, confirming they are vital for maintaining reasoning behaviors.

Error modes induced by compressing different heads. We categorize failures into repetitive, incorrect, and overlength errors, evaluated on instances solved correctly by the full KV cache. Compressing retrieval heads primarily causes overlength errors (fluent but inefficient). Conversely, compressing reasoning-critical heads triggers repetitive and incorrect errors, showing these heads are essential for chain-of-thought consistency and generation termination.

Average response length of correct samples. We report the average output length over samples answered correctly under both the full KV cache baseline and each compressed setting. Token-dropping methods can appear to produce shorter outputs under aggressive compression, but largely because they only solve easier instances. DuoAttention often requires significantly longer reasoning steps to reach correct solutions. RLKV maintains strong accuracy with competitive response lengths, indicating the best trade-off between capability preservation and inference efficiency.

@article{du2025whichheads, title={Which Heads Matter for Reasoning? RL-Guided KV Cache Compression}, author={Du, Wenjie and Jiang, Li and Tao, Keda and Liu, Xue and Wang, Huan}, journal={arXiv preprint arXiv:2510.08525}, year={2025} }