hierarchical radix cache

A hierarchical radix cache is a prefix KV cache that combines:

a radix tree for sharing common prompt prefixes, and
a memory hierarchy, typically GPU memory as the fast first level and CPU memory as the larger, slower second level.

what is a radix cache?

Suppose several prompts begin with overlapping token sequences:

Prompt 1: You are an expert programmer. Write Python code.
Prompt 2: You are an expert programmer. Review Python code.
Prompt 3: You are an expert mathematician. Solve this equation.

A radix tree stores their shared prefixes only once:

You are an expert
├── programmer
│   ├── Write Python code
│   └── Review Python code
└── mathematician
    └── Solve this equation

Each tree node represents a segment of tokens and stores the corresponding KV tensors. Therefore, the KV tensors for:

You are an expert

do not need to be duplicated for every prompt. When a new request arrives, the serving system walks down the tree and finds the longest matching prefix.

Boyang Yan

Explorer

hierarchical radix cache

what is a radix cache?

Graph View

Backlinks