Google's TurboQuant compression cuts LLM memory needs by 6x, sparks memory chip stock selloff
Google unveiled TurboQuant, a compression technique that reduces memory required to run large language models by six times by optimizing key-value cache storage. Memory chipmakers Samsung, SK Hynix, and Micron fell 5-6% on concern the efficiency breakthrough could reduce future chip demand. Analysts expect the decline reflects profit-taking rather than a fundamental shift, as more powerful models will eventually require more advanced hardware.
Google's TurboQuant Compression Cuts LLM Memory Needs by 6x, Roils Memory Chip Markets
Google's new compression method claims a six-fold reduction in memory requirements for large language models, triggering sharp selloffs in major memory chip manufacturers on concerns about reduced demand.
On Tuesday, Google unveiled TurboQuant, a compression technique targeting the key-value cache—the component that stores past calculations so AI models don't recompute them. The company claims the method reduces total memory footprint by up to six times, directly addressing inference efficiency.
The announcement prompted immediate market reaction: shares of SK Hynix and Samsung dropped 6% and nearly 5% respectively in South Korean trading on Thursday. Kioxia, Japan's third-largest memory maker, fell nearly 6%. In the U.S., SanDisk and Micron declined on Wednesday and continued lower in premarket trading Thursday.
Market Context
Memory stocks had experienced extraordinary gains prior to the announcement. Samsung shares rose nearly 200% over the preceding year, while Micron and SK Hynix gained more than 300%—driven by sustained demand for AI training and inference infrastructure alongside constrained supply.
Matthew Prince, CEO of Cloudflare, characterized the development as "Google's DeepSeek," referencing Chinese AI firm DeepSeek's efficiency breakthroughs last year that triggered a broader tech market correction. Prince noted significant optimization potential across "speed, memory usage, power consumption, and multi-tenant utilization."
Analyst Pushback
However, skepticism tempered immediate concerns. Ray Wang, memory analyst at SemiAnalysis, argued that eliminating key-value cache bottlenecks would enable more capable hardware and models, not less. "When you address a bottleneck, you help AI hardware be more capable. When the model becomes more powerful, you require better hardware to support it," Wang told CNBC.
Ben Barringer, head of technology research at Quilter Cheviot, characterized the selloff as profit-taking in a sector already primed to de-risk. "Memory stocks have had a very strong run and this is a highly cyclical sector. The Google TurboQuant innovation has added to the pressure, but this is evolutionary, not revolutionary. It does not alter the industry's long-term demand picture."
Analysts noted that the key-value cache had become a recognized bottleneck for model performance and hardware efficiency, making TurboQuant's optimization a natural engineering problem for researchers to tackle.
What This Means
TurboQuant represents genuine progress on AI efficiency but likely accelerates rather than constrains memory demand. Each efficiency improvement creates capacity for more complex models, longer context windows, and scaled inference deployments—all memory-intensive operations. The near-term market reaction reflects profit-taking in overheated memory stocks rather than fundamental demand destruction. Long-term, supply constraints and sequential model improvements will likely dominate memory demand dynamics.
Related Articles
Mistral AI traces 400MB/minute memory leak in vLLM to kernel-level mmap calls outside heap
Mistral AI's engineering team documented their investigation of a memory leak in vLLM that caused 400MB/minute memory growth during disaggregated serving with Mistral Medium 3.1. The leak, which only appeared with specific conditions including graph compilation and NIXL-based KV cache transfer, was eventually traced to mmap allocations outside the traditional heap that standard profiling tools couldn't detect.
Mistral AI fine-tunes Pixtral-12B on satellite imagery, boosting classification accuracy from 56% to 91%
Mistral AI has published research showing that fine-tuning its Pixtral-12B vision language model on satellite imagery increases classification accuracy from 56% to 91% on the Aerial Image Dataset. Using Low-Rank Adaptation (LoRA) with 8,000 training samples across 30 scene categories, the company reduced hallucinations from 5% to 0.1% for under $10 in compute costs.
Memory systems cause AI models to prioritize user preferences over accuracy, Writer research shows
AI memory systems that help models adapt to users can make them less accurate, according to two papers published by Writer. As user preferences fill the context window, models become more likely to agree with misconceptions rather than provide correct answers.
NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA
NVIDIA demonstrated that task-seeded synthetic Q&A data improves model performance across multiple benchmarks in a 100B-token continuation experiment on Nemotron-3 Nano. The approach improved GPQA scores by +11.1 points, MMLU-Pro by +1.8, average code by +1.9, and commonsense understanding by +1.6.
Comments
Loading...