model release

Alibaba releases Qwen3.5-35B-A3B-FP8, a quantized multimodal model for efficient deployment

Alibaba's Qwen team released Qwen3.5-35B-A3B-FP8 on Hugging Face, a quantized version of their 35-billion parameter multimodal model. The FP8 quantization reduces model size and memory requirements while maintaining the base model's image-text-to-text capabilities. The model is compatible with standard Transformers endpoints and Azure deployment.

1 min read

Alibaba Releases FP8-Quantized Qwen3.5-35B Multimodal Model

Alibaba's Qwen team has released Qwen3.5-35B-A3B-FP8, an FP8-quantized variant of their 35-billion parameter multimodal model, now available on Hugging Face.

Key Specifications

Qwen3.5-35B-A3B-FP8 is a quantized version of the base Qwen3.5-35B-A3B model, applying 8-bit floating-point quantization to reduce memory footprint and enable faster inference. The model maintains the multimodal capabilities of its parent, supporting image-text-to-text tasks including image understanding and conversational interactions combining visual and textual inputs.

The quantized variant is built on Qwen's Mixture-of-Experts (MoE) architecture, as indicated by the qwen3_5_moe tag. Specific parameter counts for the active model during inference and total MoE parameters are not publicly disclosed.

Deployment and Compatibility

The model is compatible with Hugging Face Transformers pipelines and standard endpoints. Alibaba explicitly lists Azure deployment support, indicating enterprise readiness. The model uses SafeTensors format for efficient loading and distributed across regions including US deployment endpoints.

The release is licensed under Apache 2.0, permitting commercial and research use with standard attribution requirements.

Community Adoption

As of the release date, the model had accumulated 157,725 downloads and 60 community likes on Hugging Face, indicating active interest from developers and researchers building with quantized multimodal systems.

What This Means

Qwen3.5-35B-A3B-FP8 addresses a practical constraint in deploying large multimodal models: memory and compute efficiency. FP8 quantization typically reduces model size by 50% compared to FP16 with minimal accuracy loss, making this variant accessible for deployment on consumer GPUs and cost-constrained cloud infrastructure. The explicit Azure compatibility signals Alibaba's push into enterprise deployment markets where Microsoft partnerships matter. For teams evaluating multimodal models between 30-40B parameters, this quantized release offers a memory-efficient option alongside full-precision variants without requiring specialized quantization expertise.

Qwen3.5-35B-A3B-FP8: Alibaba's FP8 Quantized Model | TPS