Google's TurboQuant compression cuts LLM memory needs by 6x, sparks memory chip stock selloff
Google unveiled TurboQuant, a compression technique that reduces memory required to run large language models by six times by optimizing key-value cache storage. Memory chipmakers Samsung, SK Hynix, and Micron fell 5-6% on concern the efficiency breakthrough could reduce future chip demand. Analysts expect the decline reflects profit-taking rather than a fundamental shift, as more powerful models will eventually require more advanced hardware.
Google's TurboQuant Compression Cuts LLM Memory Needs by 6x, Roils Memory Chip Markets
Google's new compression method claims a six-fold reduction in memory requirements for large language models, triggering sharp selloffs in major memory chip manufacturers on concerns about reduced demand.
On Tuesday, Google unveiled TurboQuant, a compression technique targeting the key-value cache—the component that stores past calculations so AI models don't recompute them. The company claims the method reduces total memory footprint by up to six times, directly addressing inference efficiency.
The announcement prompted immediate market reaction: shares of SK Hynix and Samsung dropped 6% and nearly 5% respectively in South Korean trading on Thursday. Kioxia, Japan's third-largest memory maker, fell nearly 6%. In the U.S., SanDisk and Micron declined on Wednesday and continued lower in premarket trading Thursday.
Market Context
Memory stocks had experienced extraordinary gains prior to the announcement. Samsung shares rose nearly 200% over the preceding year, while Micron and SK Hynix gained more than 300%—driven by sustained demand for AI training and inference infrastructure alongside constrained supply.
Matthew Prince, CEO of Cloudflare, characterized the development as "Google's DeepSeek," referencing Chinese AI firm DeepSeek's efficiency breakthroughs last year that triggered a broader tech market correction. Prince noted significant optimization potential across "speed, memory usage, power consumption, and multi-tenant utilization."
Analyst Pushback
However, skepticism tempered immediate concerns. Ray Wang, memory analyst at SemiAnalysis, argued that eliminating key-value cache bottlenecks would enable more capable hardware and models, not less. "When you address a bottleneck, you help AI hardware be more capable. When the model becomes more powerful, you require better hardware to support it," Wang told CNBC.
Ben Barringer, head of technology research at Quilter Cheviot, characterized the selloff as profit-taking in a sector already primed to de-risk. "Memory stocks have had a very strong run and this is a highly cyclical sector. The Google TurboQuant innovation has added to the pressure, but this is evolutionary, not revolutionary. It does not alter the industry's long-term demand picture."
Analysts noted that the key-value cache had become a recognized bottleneck for model performance and hardware efficiency, making TurboQuant's optimization a natural engineering problem for researchers to tackle.
What This Means
TurboQuant represents genuine progress on AI efficiency but likely accelerates rather than constrains memory demand. Each efficiency improvement creates capacity for more complex models, longer context windows, and scaled inference deployments—all memory-intensive operations. The near-term market reaction reflects profit-taking in overheated memory stocks rather than fundamental demand destruction. Long-term, supply constraints and sequential model improvements will likely dominate memory demand dynamics.
Related Articles
Google's TurboQuant cuts AI inference memory by 6x using lossless compression
Google Research unveiled TurboQuant, a lossless memory compression algorithm that reduces AI inference working memory (KV cache) by at least 6x without impacting model performance. The technology uses vector quantization methods called PolarQuant and an optimization technique called QJL. Findings will be presented at ICLR 2026.
Bytedance study: reasoning models know when to stop, but sampling methods force continued thinking
A new Bytedance study reveals that large reasoning models actually know when they've reached the correct answer, but common sampling methods prevent them from stopping. The models engage in unnecessary cross-checking and reformulation despite already solving problems correctly.
Google DeepMind argues chatbot ethics require same rigor as coding benchmarks
Google DeepMind is pushing for moral behavior in large language models to be evaluated with the same technical rigor applied to coding and math benchmarks. As LLMs take on roles like companions, therapists, and medical advisors, the research group argues current evaluation standards are insufficient.
Apple's RubiCap model generates better image captions with 3-7B parameters than 72B competitors
Apple researchers developed RubiCap, a framework for training dense image captioning models that achieve state-of-the-art results at 2B, 3B, and 7B parameter scales. The 7B model outperforms models up to 72 billion parameters on multiple benchmarks including CapArena and CaptionQA, while the 3B variant matches larger 32B models, suggesting efficient dense captioning doesn't require massive scale.
Comments
Loading...