quantization

24 articles tagged with quantization

June 17, 2026
model releaseGoogle DeepMind

NVIDIA Releases Quantized DiffusionGemma 26B: 1,100+ Tokens/Second with 256K Context Window

NVIDIA released a quantized version of Google DeepMind's DiffusionGemma 26B A4B IT, a multimodal model with 25.2B total parameters (3.8B active) that processes text, image, and video inputs. The NVFP4-quantized model achieves generation speeds exceeding 1,100 tokens per second on NVIDIA H100 GPUs while supporting a 256K token context window.

June 9, 2026

Google DeepMind Releases Quantization-Aware Training Versions of Gemma 4 Models in GGUF Format

Google DeepMind has released quantization-aware training (QAT) optimized versions of its Gemma 4 model family in GGUF Q4_0 format. The QAT versions preserve similar quality to bfloat16 while dramatically reducing memory requirements, with models available across the entire Gemma 4 lineup: E2B, E4B, 12B, 26B A4B, and 31B.

June 3, 2026
analysis

Ideogram AI releases FP8-quantized image generation model on Hugging Face alongside Google's Gemma-4-12B text models

Three new models appeared on Hugging Face: Ideogram AI's FP8-quantized version of its Ideogram-4 image generation model and Google's Gemma-4-12B text models in both base and instruction-tuned variants. The releases mark continued expansion of model availability through Hugging Face's platform.

June 2, 2026
model release

H Company Ships Holo3.1 with Local Inference, Mobile Support, and 79.3% AndroidWorld Score

H Company released Holo3.1, a computer-use agent model family ranging from 0.8B to 35B parameters. The 35B-A3B variant scores 79.3% on AndroidWorld, up from 67% in Holo3. For the first time, H Company ships quantized checkpoints (FP8, Q4 GGUF, NVFP4) enabling local inference with 1.74× throughput gains and sub-4-second agent step times.

June 1, 2026
model releaseStepFun+1

StepFun Releases Step-3.7-Flash: 198B-Parameter Sparse MoE Model With 256K Context in GGUF Format

StepFun has released Step-3.7-Flash, a 198B-parameter sparse Mixture-of-Experts vision-language model that activates approximately 11B parameters per token. The model supports a 256K context window, native image understanding via a 1.8B-parameter vision encoder, and offers three selectable reasoning levels.

May 29, 2026
model releaseLiquid Ai

Liquid AI Releases LFM2.5-8B: 8-Billion Parameter Hybrid Model Optimized for Edge Deployment

Liquid AI has released LFM2.5-8B-A1B, an 8-billion parameter hybrid model designed specifically for edge AI and on-device deployment. The model is available in multiple GGUF quantized formats ranging from 4-bit (4.84 GB) to 16-bit (16.9 GB), optimized for memory efficiency.

May 23, 2026
model releaseTencent

Tencent Releases Hy-MT2 Translation Models: 1.8B, 7B, and 30B-A3B Support 33 Languages

Tencent released Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B (MoE) sizes. All models support translation among 33 languages and follow translation instructions in multiple languages. The 1.8B model can be compressed to 440MB using 1.25-bit AngelSlim quantization.

May 22, 2026
model releaseTencent

Tencent Releases Hy-MT2: 1.8B Translation Model Compressed to 440MB With 1.25-Bit Quantization

Tencent has open-sourced Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B parameter sizes. The models support translation across 33 languages and include extreme quantization down to 1.25-bit, reducing the 1.8B model to 440MB storage while increasing inference speed by 1.5x.

May 20, 2026
model releaseCohere

Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU

Cohere has released Command A+, an open-source sparse mixture-of-experts model with 218 billion total parameters and 25 billion active parameters. The model features W4A4 quantization allowing deployment on a single Nvidia B200 GPU, supports 128K input context, and includes built-in chain-of-thought reasoning with vision capabilities.

May 5, 2026
model releaseIbm

IBM releases Apache 2.0 Granite 4.1 LLMs in 3B, 8B, and 30B sizes

IBM has released the Granite 4.1 family of language models under Apache 2.0 license. The models come in 3B, 8B, and 30B parameter sizes. Unsloth has released 21 GGUF quantized variants of the 3B model ranging from 1.2GB to 6.34GB.

April 22, 2026
analysis

Qwen 3.6 27B Released With FP8 Quantization, OpenAI Deploys Privacy Filter Model

Alibaba Cloud released Qwen 3.6 27B, a 27-billion parameter language model, alongside an FP8 quantized version for deployment efficiency. Separately, OpenAI published a privacy filter model on Hugging Face, marking a rare public model release from the company.

model release

Alibaba Qwen Releases 27B Parameter Model That Claims to Match 397B Performance on Coding Tasks

Alibaba Qwen released Qwen3.6-27B, a 27B parameter dense model that claims flagship-level coding performance surpassing their previous 397B MoE model across major coding benchmarks. The full model is 55.6GB compared to 807GB for the predecessor.

model release

Gemma 4 VLA runs locally on NVIDIA Jetson Orin Nano Super with 8GB RAM, autonomous webcam tool-calling

NVIDIA engineer Asier Arranz demonstrated Gemma 4 running as a vision-language agent (VLA) on a Jetson Orin Nano Super with 8GB RAM. The model autonomously decides when to access a webcam based on user queries, with no hardcoded triggers—performing speech-to-text, vision analysis, and text-to-speech entirely locally.

April 16, 2026
benchmark

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.

April 9, 2026
model releaseZhipu AI

GLM-5.1 released: 754B agentic model outperforms Claude on coding benchmarks

Zhipu AI released GLM-5.1, a 754-parameter model optimized for agentic engineering tasks. The model scores 58.4% on SWE-Bench Pro, outperforming Claude 3.5 Sonnet (57.3%), and demonstrates sustained reasoning capability over hundreds of iterations.

April 4, 2026
analysis

Tencent releases HY-OmniWeaving multimodal model as Gemma-4 variants emerge

Tencent has released HY-OmniWeaving, a new multimodal model available on Hugging Face. Concurrently, NVIDIA and Unsloth have published optimized variants of Gemma-4, including a 31B instruction-tuned version and quantized GGUF format.

model release

PrismML releases 1-bit Bonsai 8B model, claims 14x smaller and 5x more energy efficient than full-precision peers

PrismML, a Caltech-founded startup, has released Bonsai 8B, a 1-bit quantized large language model that the company claims is 14x smaller and 5x more energy efficient than full-precision counterparts while remaining competitive with standard 8B models. The model fits into 1.15GB of memory and uses a novel 1-bit weight representation (binary signs with shared scale factors per weight group) instead of traditional 16-bit or 32-bit precision.

model releaseGoogle DeepMind

NVIDIA releases Gemma 4 31B quantized model with 256K context, multimodal capabilities

NVIDIA has released a quantized version of Google DeepMind's Gemma 4 31B IT model, compressed to NVFP4 format for efficient inference on consumer GPUs. The 30.7B-parameter multimodal model supports 256K token context windows, handles text and image inputs with video frame processing, and maintains near-baseline performance across reasoning and coding benchmarks.

April 1, 2026
research

Google's TurboQuant compresses AI memory use by 6x, but won't ease DRAM shortage

Google has unveiled TurboQuant, a KV cache quantization technology that claims to reduce memory consumption during AI inference by up to 6x by compressing data from 16-bit precision to as low as 2.5 bits. While the compression technique delivers meaningful efficiency gains for inference providers, it is unlikely to resolve the DRAM shortage that has driven memory prices to record highs, as expanding context windows offset memory savings.

March 25, 2026
research

Google's TurboQuant cuts AI inference memory by 6x using lossless compression

Google Research unveiled TurboQuant, a lossless memory compression algorithm that reduces AI inference working memory (KV cache) by at least 6x without impacting model performance. The technology uses vector quantization methods called PolarQuant and an optimization technique called QJL. Findings will be presented at ICLR 2026.

March 24, 2026
changelogStability AI

Stable Diffusion 3.5 TensorRT optimization delivers 2x faster generation, 40% less VRAM on RTX GPUs

Stability AI has released TensorRT-optimized versions of the Stable Diffusion 3.5 model family in collaboration with NVIDIA. The optimization uses FP8 quantization to achieve 2x faster generation speed and 40% lower VRAM requirements on supported RTX GPUs.

March 12, 2026
model releaseNVIDIA

NVIDIA releases Nemotron-3-Super-120B, a 120B parameter model with latent MoE architecture

NVIDIA has released Nemotron-3-Super-120B-A12B-NVFP4, a 120-billion parameter text generation model featuring a latent Mixture-of-Experts (MoE) architecture. The model supports 8 languages including English, French, Spanish, Italian, German, Japanese, and Chinese, and is available on Hugging Face with 8-bit quantization support through NVIDIA's ModelOpt toolkit.

March 1, 2026
model release

Alibaba releases Qwen3.5-35B-A3B-FP8, a quantized multimodal model for efficient deployment

Alibaba's Qwen team released Qwen3.5-35B-A3B-FP8 on Hugging Face, a quantized version of their 35-billion parameter multimodal model. The FP8 quantization reduces model size and memory requirements while maintaining the base model's image-text-to-text capabilities. The model is compatible with standard Transformers endpoints and Azure deployment.

February 20, 2026
product update

Taalas serves Llama 3.1 8B at 17,000 tokens/second with custom silicon

Taalas, a new Canadian hardware startup, announced its first product: a custom silicon implementation of Meta's Llama 3.1 8B model running at 17,000 tokens/second. The startup uses aggressive quantization combining 3-bit and 6-bit parameters. The system is accessible via chatjimmy.ai.