What Is KV Cache in LLM Inference?
In Large Language Model (LLM) Inference Serving Engines, the KV cache is a memory cache that stores the attention keys and values computed for previous tokens. “KV” stands for key-value, referring to the key and value tensors used inside the transformer attention mechanism.
When a language model generates text, it produces one token at a time. Without a KV cache, the model would need to recompute the attention keys and values for the entire prompt and all previously generated tokens at every decoding step. With a KV cache, the model reuses those stored tensors and only computes the keys and values for the newest token.
This makes autoregressive generation much faster, especially for long prompts or long outputs. The tradeoff is memory: the KV cache grows with sequence length, batch size, number of layers, number of attention heads, and hidden dimension. For serving systems, KV-cache management is often one of the main constraints on throughput, latency, and maximum context length.
In practice, LLM inference engines optimize KV cache usage through techniques such as paged attention, prefix caching, KV cache eviction, quantized cache storage, and distributed cache placement across GPUs.