IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models



Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the costs skyrocket. Researchers from Tsinghua University and Z.ai have built a technique called IndexCache that reduces up to 75% of redundant computation in sparse attention models, delivering up to 1.82x faster first token get time and 1.48x faster generation performance over that context length.

The technique is applied to models that use the DeepSeek Sparse Attention architecture, including the latest DeepSeek and GLM families. It can help companies provide faster user experiences for long-context and production-scale models, a capability already proven in early testing on the 744 billion-parameter GLM-5 model.

The DSA bottleneck

Large language models rely on the self-attention mechanism, a process in which the model calculates the relationship between each token in its context and all previous ones to predict the next token.

However, self-care has a serious limitation. Its computational complexity increases quadratically with the length of the sequence. For applications that require extended context windows (e.g., large document processing, multi-step agent workflows, or long-chain-of-thought reasoning), this quadratic scaling results in slow inference speeds and significant compute and memory costs.

Little attention offers a principled solution to this problem of scale. Instead of computing the relationship between each token and all previous ones, sparse attention optimizes the process by having each query select and serve only the most relevant subset of tokens.

DeepSeek Sparse Attention (DSA) is a highly efficient implementation of this concept, first introduced in DeepSeek-V3.2. To determine which tokens are most important, DSA presents a slight "ray indexing module" in each layer of the model. This indexer qualifies all the above tokens and selects a small batch to be processed by the main attention mechanism. By doing this, DSA dramatically reduces the heavy central attention computation from quadratic to linear, dramatically speeding up the model and preserving output quality.

But the researchers identified a persistent flaw: the DSA indexer still operates with quadratic complexity at each layer. Although the indexer is computationally cheaper than the main attention process, as the context length increases, the time the model spends running these indexers skyrockets. This severely slows down the model, especially during the initial phases. "prefill" Stage in which the message is first processed.

Attention caching with IndexCache

To solve the indexer bottleneck, the research team discovered a crucial feature of how DSA models process data. The subset of important tokens that an indexer selects remains remarkably stable as the data moves through consecutive transformative layers. Empirical tests on DSA models revealed that adjacent layers share between 70% and 100% of their selected tokens.

To take advantage of this redundancy between layers, the researchers developed IndexCache. The technique divides the model layers into two categories. A small number of full layers (F) maintain their indexers, actively score tokens, and choose the most important ones to cache. The rest of the layers become shared (S), without performing indexing and reusing the cached indexes of the nearest previous layer F.

During inference, the model simply checks the layer type. If it reaches an F layer, it computes and caches new indexes. If it is an S layer, it skips the calculations and copies the cached data.

There is a wide range of optimization techniques that attempt to address the attention bottleneck by compressing the KV cachewhere the calculated attention values ​​are stored. Instead of reducing the memory footprint like standard KV cache compression, IndexCache attacks the compute bottleneck.

“IndexCache is not a traditional KV cache compression or sharing technique,” ​​Yushi Bai, co-author of the paper, told VentureBeat. “It eliminates this redundancy by reusing indexes between layers, thereby reducing computation rather than just memory footprint. It is complementary to existing approaches and can be combined with them.”

Researchers developed two implementation approaches for IndexCache. (It’s worth noting that IndexCache only applies to models that use the DSA architecture, such as the latest DeepSeek models and the latest GLM family of models.)

For developers working with commercially available DSA models where retraining is infeasible or too expensive, they created a training-free method that relies on a “greedy layer selection” algorithm. By running a small set of calibration data through the model, this algorithm automatically determines the optimal placement of the F and S layers without any weight updates. Empirical evidence shows that the greedy algorithm can safely eliminate 75% of the indexers while matching the subsequent performance of the original model.

For teams that pre-train or heavily tune their own base models, the researchers propose a training-friendly version that optimizes network parameters to natively support cross-layer sharing. This approach introduces a “multilayer distillation loss” during training. It forces each retained indexer to learn how to select a subset of consensus tokens that will be highly relevant to all subsequent layers it serves.

Real-world accelerations in production models

To test the impact of IndexCache, the researchers applied it to all 30 billion parameters. Flash GLM-4.7 model and compared it to the standard baseline.

With a context length of 200,000, removing 75% of the indexers reduced the prefetch latency from 19.5 seconds to just 10.7 seconds, resulting in a 1.82x speedup. The researchers note that these accelerations are expected to be even greater in longer contexts.

During the decoding phase, where the model generates its response, IndexCache increased the per-request throughput from 58 tokens per second to 86 tokens per second at the 200K context mark, resulting in a 1.48x speedup. When the server memory is completely saturated with requests, the total decoding performance increased by up to 51%.

For business teams, these efficiency gains translate directly into cost savings. “In terms of ROI, IndexCache provides consistent benefits in all scenarios, but the gains are most notable in long-context workloads such as RAG, document analysis, and agent pipelines,” Bai said. “In these cases, we see at least an approximate 20% reduction in implementation cost and similar improvements in user-perceived latency.” He added that for very short context tasks, the benefits are around 5%.

Surprisingly, these efficiency gains did not compromise reasoning abilities. Using the no-training approach to remove 75% of indexers, model 30B matched the original baseline’s average score on long-context benchmarks, with a score of 49.9 versus the original 50.2. On the highly complex AIME 2025 mathematical reasoning benchmark, the optimized model actually outperformed the original baseline, with a score of 92.6 compared to 91.0.

The team also conducted preliminary experiments on the 744 billion-parameter GLM-5 model at production scale. They found that removing 75% of their indexers with the untrained method produced at least a 1.3x speedup in contexts larger than 100,000 tokens. At the same time, the model maintained a nearly identical quality average on long-context tasks.

Put IndexCache into production

For development teams looking to implement the no-training approach today, the process is simple but requires careful setup. While the greedy search algorithm automatically finds the optimal layer configuration, the quality of that configuration depends on the data it processes.

“We recommend using domain-specific data as a calibration set so that the discovered layer swapping pattern aligns with actual workloads,” Bai said.

Once calibrated, optimization is very accessible for production environments. Open source patches are now available on GitHub for major service engines. “Integration is relatively simple: developers can apply the patch to existing inference stacks, such as vLLM or SGLang, and enable IndexCache with minimal configuration changes,” Bai said.

While IndexCache provides an immediate solution to today’s computing bottlenecks, its underlying philosophy points to a broader shift in how the AI ​​industry will approach model design.

“Future baseline models will likely be designed with downstream inference constraints in mind from the beginning,” Bai concluded. “This means designs that are not only scalable in terms of model size, but are also optimized for real-world performance and latency, rather than treating them as post hoc concerns.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *