Context compression finally works in production: New research reduces LLM input by 16 times without affecting accuracy

Context windows are becoming a computational bottleneck. The longer an agent runs, the more tokens are accumulated from retrieved documents, reasoning traces, and conversation history, and the more memory and computation the growing context demands. Most existing solutions degrade model accuracy, require the full context to be loaded before compression begins, or produce memory savings that do not translate into real speedups on standard serving infrastructure.

A research team from New York University, Columbia, Princeton, the University of Maryland, Harvard and Lawrence Livermore National Laboratory. published an article this week which proposes a novel solution. The researchers introduce the concept of latent context language models, or LCLMs, a family of encoder-decoder compression models that compress the input context before it reaches the decoder. The models are open source at HuggingFace.

Unlike KV cache compression methods (the dominant approach in the field, which still materializes the entire KV cache before evicting entries), LCLMs compress the input token sequence before decoder prefetching, so higher compression ratios directly reduce computation and memory on the decoder side. The paper reports that LCLMs with 16x compression produced results 8.8 times faster than KV cache baselines on the RULER long context benchmark.

"These growing contexts consume memory and computation, and are becoming a computational bottleneck for LLMs." Micah Goldblum, co-lead advisor on the project and a researcher at Columbia University, told VentureBeat. "Our goal was to train end-to-end language models that can handle very long contexts efficiently and accurately. If you can create a language model like that, everything will be cheaper and faster."

What LCLMs can do

LCLMs allow models to process much longer contexts than would be practical, at a fraction of the memory and compute cost, without the accuracy degradation that makes most compression methods a poor trade-off in production.

With 4x compression, the paper reports an accuracy of 91.76% on the RULER benchmark, compared to 94.41% without any compression. That’s less than a 3 point drop to reduce the context to a quarter of its original size. At 16x compression, where 93.75% of the input tokens are removed, the accuracy dropped to 75.06%. Each KV cache method tested with the same compression ratio scored lower.

Profits are also maintained on shorter innings. On GSM8K math word problems, where the entire message is compressed rather than just the recovered documents, LCLMs outperformed all other methods tested, regardless of compression ratio.

how it was built

The architecture combines a 0.6B encoder with a 4B decoder. The encoder compresses blocks of input tokens into shorter sequences of latent embeddings. The decoder processes them instead of the original tokens. The training covered over 350 billion tokens.

The training recipe combines three types of data:

Continuous pre-training data with compressed and uncompressed sections interspersed
Supervised tuning data covering reasoning and long-context tasks.
An auxiliary reconstruction task that forces the encoder to retain fine details

The combination addresses a trade-off that limited previous compression work, where preserving reconstruction accuracy came at the cost of overall task performance.

An architecture search identified the optimal configuration. The article found that scaling the decoder is more important than scaling the encoder.

Where does it fit in an agent stack?

An LCLM is not an abstract research concept. It is designed to work with an existing battery. "You can simply exchange LCLMs for any existing LLM," Goldblum said. "Whenever you retrieve data, such as documents, and want to dump them into your model context, simply run those documents through the LCLM compressor first."

He noted that in the research work, the researchers demonstrated how to create agents that selectively decompress useful text.

"Think of this as a human skimming through the content before zooming in on relevant details." Goldblum said.

Goldblum also cautioned that teams integrating the approach into existing agent processes will need to adjust their RAG systems accordingly.

"We have also not worked on online compression of reasoning traces," said. "The naive approach of occasionally compressing the trace as it is generated might work, but that remains to be determined."

What this means for businesses

Context windows are growing faster than inference infrastructure can keep up, and companies are already investing to address it. VB Pulse Q1 2026 survey data from more than 100 employee organizations shows that hybrid recovery adoption intention tripled from 10.3% in January to 33.3% in March. Recovery optimization overtook appraisal as the top investment priority in March, reaching 28.9% of qualified respondents.

Three things stand out for teams evaluating production adjustment:

The cost of inference increases with the length of the context. At 1 million tokens, uncompressed inference with standard KV cache methods runs out of memory on a single H200 GPU. The article reports that LCLMs with 16x compression stay within memory limits at that context length.
The integration of the RAG pipeline requires adjustments. Teams with existing RAG channels will need to validate compression behavior against their recovery quality metrics before deploying at scale.
Reasoning trace compression is not resolved. For agents executing long chains of reasoning, context growth from tracking is a problem independent of document retrieval. Goldblum recognized the gap directly: the naive approach of periodic trace compression could work, but it has not been tested.

The models are available at huggingface.co/latent-context and the code at github.com/LeonLixyz/LCLM.

"The most important thing our architectures do is give your model access to much larger contexts, but they also unlock multi-scale approaches where your model can read large amounts of text or code super fast and then only zoom in and fully read a small portion of the most useful text." Goldblum said.

Source link

Context compression finally works in production: New research reduces LLM input by 16 times without affecting accuracy

What LCLMs can do

how it was built

Where does it fit in an agent stack?

What this means for businesses

Leave a ReplyCancel Reply

Telegram app for Wear OS returns after five years

June 11, 2026: visionOS features 27 and more

This missing feature makes it impossible to go back to Google’s Pixel

What LCLMs can do

how it was built

Where does it fit in an agent stack?

What this means for businesses

Leave a ReplyCancel Reply

Trending now

Telegram app for Wear OS returns after five years

June 11, 2026: visionOS features 27 and more

This missing feature makes it impossible to go back to Google’s Pixel