model release

Alibaba releases Qwen3.5-35B-A3B-FP8, a quantized multimodal model for efficient deployment

TL;DR

Alibaba's Qwen team released Qwen3.5-35B-A3B-FP8 on Hugging Face, a quantized version of their 35-billion parameter multimodal model. The FP8 quantization reduces model size and memory requirements while maintaining the base model's image-text-to-text capabilities. The model is compatible with standard Transformers endpoints and Azure deployment.

1 min read
0

Alibaba Releases FP8-Quantized Qwen3.5-35B Multimodal Model

Alibaba's Qwen team has released Qwen3.5-35B-A3B-FP8, an FP8-quantized variant of their 35-billion parameter multimodal model, now available on Hugging Face.

Key Specifications

Qwen3.5-35B-A3B-FP8 is a quantized version of the base Qwen3.5-35B-A3B model, applying 8-bit floating-point quantization to reduce memory footprint and enable faster inference. The model maintains the multimodal capabilities of its parent, supporting image-text-to-text tasks including image understanding and conversational interactions combining visual and textual inputs.

The quantized variant is built on Qwen's Mixture-of-Experts (MoE) architecture, as indicated by the qwen3_5_moe tag. Specific parameter counts for the active model during inference and total MoE parameters are not publicly disclosed.

Deployment and Compatibility

The model is compatible with Hugging Face Transformers pipelines and standard endpoints. Alibaba explicitly lists Azure deployment support, indicating enterprise readiness. The model uses SafeTensors format for efficient loading and distributed across regions including US deployment endpoints.

The release is licensed under Apache 2.0, permitting commercial and research use with standard attribution requirements.

Community Adoption

As of the release date, the model had accumulated 157,725 downloads and 60 community likes on Hugging Face, indicating active interest from developers and researchers building with quantized multimodal systems.

What This Means

Qwen3.5-35B-A3B-FP8 addresses a practical constraint in deploying large multimodal models: memory and compute efficiency. FP8 quantization typically reduces model size by 50% compared to FP16 with minimal accuracy loss, making this variant accessible for deployment on consumer GPUs and cost-constrained cloud infrastructure. The explicit Azure compatibility signals Alibaba's push into enterprise deployment markets where Microsoft partnerships matter. For teams evaluating multimodal models between 30-40B parameters, this quantized release offers a memory-efficient option alongside full-precision variants without requiring specialized quantization expertise.

Related Articles

model release

Alibaba's Qwen Releases Qwen3.7 Plus: 1M Context Window at $0.40 Per Million Input Tokens

Alibaba's Qwen has released Qwen3.7 Plus, a multimodal model with a 1 million token context window. The model accepts text and image input with text output, priced at $0.40 per million input tokens and $1.60 per million output tokens through OpenRouter's API.

model release

Ideogram 4: 9.3B parameter open-weight text-to-image model with native 2K resolution and structured JSON prompting

Ideogram has released Ideogram 4, its first open-weight text-to-image model with 9.3 billion parameters. The model supports native 2K resolution, structured JSON prompting with bounding-box layout controls, and is available in nf4 and fp8 quantizations under a non-commercial license.

model release

Microsoft releases MAI-Thinking-1, its first reasoning AI model trained without third-party distillation

Microsoft announced MAI-Thinking-1, its first advanced reasoning AI model, at Build 2026. The company claims it's a medium-sized model matching leading models on key software engineering benchmarks, trained from scratch without distillation from third-party models.

model release

Nvidia releases Nemotron 3 Ultra: 550B-parameter MoE model with 1M context window for agentic workflows

Nvidia has released Nemotron 3 Ultra, a 550-billion parameter mixture-of-experts model with 55 billion active parameters and support for up to 1 million token context windows. The model uses a hybrid Transformer-Mamba architecture and is designed specifically for long-running agentic workflows including agent orchestration, coding agents, and complex enterprise tasks.

Comments

Loading...