Google’s TurboQuant compresses AI memory 6 times and shakes up chip stocks


Google published a research blog post on Tuesday about a new compression algorithm for AI models. Within hours, memory stocks were falling. Micron fell 3 percent, Western Digital lost 4.7 percent and SanDisk fell 5.7 percent, as investors recalculated how much physical memory the AI ​​industry might actually need.

The algorithm is called TurboQuant, and it addresses one of the most expensive bottlenecks in running large language models: the key-value cache, a high-speed data store that contains context information so the model doesn’t have to recompute it with each new token it generates. As models process longer inputs, the cache grows rapidly, consuming GPU memory that could otherwise be used to serve more users or run larger models. TurboQuant compresses the cache to just 3 bits per value, down from the standard 16, reducing its memory usage by at least six times without, according to Google benchmarks, any measurable loss in accuracy.

The paper, which will be presented at ICLR 2026, was written by Amir Zandieh, a research scientist at Google, and Vahab Mirrokni, vice president and fellow at Google, along with collaborators from Google DeepMind, KAIST, and New York University. It builds on two previous papers from the same group: QJL, published at AAAI 2025, and PolarQuant, which will appear at AISTATS 2026.

how it works

TurboQuant’s main innovation is to eliminate the overhead that makes most compression techniques less effective than the headline numbers suggest. Traditional quantization methods reduce the size of data vectors, but must store additional constants, normalization values ​​that the system needs to accurately decompress the data. These constants typically add one or two extra bits per number, partially undoing the compression.

The 💜 of EU technology

The latest rumors from the EU tech scene, a story from our wise founder Boris and some questionable AI art. It’s free, every week, in your inbox. Sign up now!

TurboQuant avoids this through a two-stage process. The first stage, called PolarQuant, converts data vectors from standard Cartesian coordinates to polar coordinates, separating each vector into a magnitude and a set of angles. Because angular distributions follow concentrated, predictable patterns, the system can skip the costly block normalization step entirely. The second stage applies QJL, a technique based on the Johnson-Lindenstrauss transform, which reduces the small residual error of the first stage to a single sign bit per dimension. The combined result is a representation that uses most of its compression budget to capture the meaning of the original data and a minimal residual budget for error correction, without wasting overhead on normalization constants.

Google tested TurboQuant on five standard benchmarks for long-context language models, including LongBench, Needle in a Haystack, and ZeroSCROLLS, using open source models from the Gemma, Mistral, and Llama families. At 3 bits, TurboQuant matched or surpassed KIVI, the current standard basis for key-value cache quantization, which was published at ICML 2024. On needle-in-a-haystack retrieval tasks, which test whether a model can locate a single piece of information buried in a long passage, TurboQuant achieved perfect scores while compressing the cache by a factor of six. At 4-bit precision, the algorithm accelerated computing attention on Nvidia H100 GPUs by up to eight times compared to the uncompressed 32-bit baseline.

What the market heard

The stock reaction was swift and, in the opinion of several analysts, disproportionate. Wells Fargo analyst Andrew Rocha noted that TurboQuant directly attacks the memory cost curve in AI systems. If widely adopted, he said, the question quickly arises of how much memory capacity the industry really needs. But Rocha and others also cautioned that the AI ​​memory demand landscape remains strong and that compression algorithms have existed for years without fundamentally altering procurement volumes.

However, the concern is not unfounded. AI Infrastructure Spending Is Growing at an Extraordinary Rate, and Only Meta is Committing up to $27 billion in recent deal with Nebius for dedicated computing capacity, and Google, Microsoft and Amazon collectively plan hundreds of billions in data center capital spending through 2026. A technology that reduces memory requirements six-fold does not reduce spending six-fold, because memory is only one component of the cost of a data center. But change the ratio, and in an industry spending at this scale, even marginal efficiency gains add up quickly.

The question of efficiency

TurboQuant comes at a time when the AI ​​industry is forced to confront the economics of inference. Training a model is a one-time cost, no matter how enormous. Running it and serving millions of queries per day with acceptable latency and accuracy is the recurring expense that determines whether AI products are financially viable at scale. The key-value cache is critical to this calculation: it is the bottleneck that limits how many simultaneous users a single GPU can serve and how long a model can practically support a context window.

Compression techniques like TurboQuant are part of a broader push to make inference cheaper, along with hardware improvements like Nvidia’s Vera Rubin architecture and Google’s Ironwood TPUs. The question is whether these efficiency gains whether they will reduce the total amount of hardware the industry purchases, or whether they will simply allow more ambitious implementations at about the same cost. The history of computing suggests the latter: when storage becomes cheaper, people store more; When bandwidth increases, applications consume it.

For Google, TurboQuant also has a direct commercial application beyond language models. The blog post notes that the algorithm improves vector search, the technology that powers semantic similarity searches across billions of elements. Google tested it against existing methods on the GloVe benchmark dataset and found that it achieved superior recall rates without requiring the large codebooks or dataset-specific adjustments that competing approaches require. This is important because vector search underpins everything from Google Search to YouTube recommendations to ad targeting—it powers Google’s revenue.

The paper’s contribution is real: a training-free compression method that achieves significantly better results than the existing state-of-the-art, with solid theoretical foundations and a practical implementation on production hardware. Whether it reshapes the economics of AI infrastructure or simply becomes more optimization absorbed by the industry’s insatiable appetite for computing is a question the market will answer in months, not hours.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *