quantization

15 articles tagged with quantization

May 5, 2026
model releaseIbm

IBM releases Apache 2.0 Granite 4.1 LLMs in 3B, 8B, and 30B sizes

IBM has released the Granite 4.1 family of language models under Apache 2.0 license. The models come in 3B, 8B, and 30B parameter sizes. Unsloth has released 21 GGUF quantized variants of the 3B model ranging from 1.2GB to 6.34GB.

April 22, 2026
analysis

Qwen 3.6 27B Released With FP8 Quantization, OpenAI Deploys Privacy Filter Model

Alibaba Cloud released Qwen 3.6 27B, a 27-billion parameter language model, alongside an FP8 quantized version for deployment efficiency. Separately, OpenAI published a privacy filter model on Hugging Face, marking a rare public model release from the company.

model release

Alibaba Qwen Releases 27B Parameter Model That Claims to Match 397B Performance on Coding Tasks

Alibaba Qwen released Qwen3.6-27B, a 27B parameter dense model that claims flagship-level coding performance surpassing their previous 397B MoE model across major coding benchmarks. The full model is 55.6GB compared to 807GB for the predecessor.

model release

Gemma 4 VLA runs locally on NVIDIA Jetson Orin Nano Super with 8GB RAM, autonomous webcam tool-calling

NVIDIA engineer Asier Arranz demonstrated Gemma 4 running as a vision-language agent (VLA) on a Jetson Orin Nano Super with 8GB RAM. The model autonomously decides when to access a webcam based on user queries, with no hardcoded triggers—performing speech-to-text, vision analysis, and text-to-speech entirely locally.

April 16, 2026
benchmark

Qwen3.6-35B-A3B Outperforms Claude Opus 4.7 on SVG Generation Test

In an informal SVG generation benchmark, Alibaba's Qwen3.6-35B-A3B model running locally via a 20.9GB quantized version outperformed Anthropic's newly released Claude Opus 4.7. The test, which asked models to generate SVG illustrations of pelicans and flamingos on bicycles, showed the smaller local model producing more accurate bicycle frames and more creative outputs.

April 9, 2026
model releaseZhipu AI

GLM-5.1 released: 754B agentic model outperforms Claude on coding benchmarks

Zhipu AI released GLM-5.1, a 754-parameter model optimized for agentic engineering tasks. The model scores 58.4% on SWE-Bench Pro, outperforming Claude 3.5 Sonnet (57.3%), and demonstrates sustained reasoning capability over hundreds of iterations.

April 4, 2026
analysis

Tencent releases HY-OmniWeaving multimodal model as Gemma-4 variants emerge

Tencent has released HY-OmniWeaving, a new multimodal model available on Hugging Face. Concurrently, NVIDIA and Unsloth have published optimized variants of Gemma-4, including a 31B instruction-tuned version and quantized GGUF format.

model release

PrismML releases 1-bit Bonsai 8B model, claims 14x smaller and 5x more energy efficient than full-precision peers

PrismML, a Caltech-founded startup, has released Bonsai 8B, a 1-bit quantized large language model that the company claims is 14x smaller and 5x more energy efficient than full-precision counterparts while remaining competitive with standard 8B models. The model fits into 1.15GB of memory and uses a novel 1-bit weight representation (binary signs with shared scale factors per weight group) instead of traditional 16-bit or 32-bit precision.

model releaseGoogle DeepMind

NVIDIA releases Gemma 4 31B quantized model with 256K context, multimodal capabilities

NVIDIA has released a quantized version of Google DeepMind's Gemma 4 31B IT model, compressed to NVFP4 format for efficient inference on consumer GPUs. The 30.7B-parameter multimodal model supports 256K token context windows, handles text and image inputs with video frame processing, and maintains near-baseline performance across reasoning and coding benchmarks.

April 1, 2026
research

Google's TurboQuant compresses AI memory use by 6x, but won't ease DRAM shortage

Google has unveiled TurboQuant, a KV cache quantization technology that claims to reduce memory consumption during AI inference by up to 6x by compressing data from 16-bit precision to as low as 2.5 bits. While the compression technique delivers meaningful efficiency gains for inference providers, it is unlikely to resolve the DRAM shortage that has driven memory prices to record highs, as expanding context windows offset memory savings.

March 25, 2026
research

Google's TurboQuant cuts AI inference memory by 6x using lossless compression

Google Research unveiled TurboQuant, a lossless memory compression algorithm that reduces AI inference working memory (KV cache) by at least 6x without impacting model performance. The technology uses vector quantization methods called PolarQuant and an optimization technique called QJL. Findings will be presented at ICLR 2026.

March 24, 2026
changelogStability AI

Stable Diffusion 3.5 TensorRT optimization delivers 2x faster generation, 40% less VRAM on RTX GPUs

Stability AI has released TensorRT-optimized versions of the Stable Diffusion 3.5 model family in collaboration with NVIDIA. The optimization uses FP8 quantization to achieve 2x faster generation speed and 40% lower VRAM requirements on supported RTX GPUs.

March 12, 2026
model releaseNVIDIA

NVIDIA releases Nemotron-3-Super-120B, a 120B parameter model with latent MoE architecture

NVIDIA has released Nemotron-3-Super-120B-A12B-NVFP4, a 120-billion parameter text generation model featuring a latent Mixture-of-Experts (MoE) architecture. The model supports 8 languages including English, French, Spanish, Italian, German, Japanese, and Chinese, and is available on Hugging Face with 8-bit quantization support through NVIDIA's ModelOpt toolkit.

March 1, 2026
model release

Alibaba releases Qwen3.5-35B-A3B-FP8, a quantized multimodal model for efficient deployment

Alibaba's Qwen team released Qwen3.5-35B-A3B-FP8 on Hugging Face, a quantized version of their 35-billion parameter multimodal model. The FP8 quantization reduces model size and memory requirements while maintaining the base model's image-text-to-text capabilities. The model is compatible with standard Transformers endpoints and Azure deployment.

February 20, 2026
product update

Taalas serves Llama 3.1 8B at 17,000 tokens/second with custom silicon

Taalas, a new Canadian hardware startup, announced its first product: a custom silicon implementation of Meta's Llama 3.1 8B model running at 17,000 tokens/second. The startup uses aggressive quantization combining 3-bit and 6-bit parameters. The system is accessible via chatjimmy.ai.