Alibaba Qwen Releases 35B Sparse MoE Model with 262K Context and Multimodal Support
Alibaba Cloud has released Qwen3.6-35B-A3B, an open-weight sparse mixture-of-experts model with 35 billion total parameters but only 3 billion active parameters per token. The model features a 262K native context window (expandable to 1M tokens), multimodal input support, and integrated reasoning mode with preserved thinking traces.
Qwen3.6 35B A3B — Quick Specs
Alibaba Qwen Releases 35B Sparse MoE Model with 262K Context and Multimodal Support
Alibaba Cloud has released Qwen3.6-35B-A3B, an open-weight sparse mixture-of-experts model with 35 billion total parameters but only 3 billion active parameters per token.
Architecture and Specifications
The model uses a hybrid sparse MoE architecture that combines Gated DeltaNet linear attention with standard gated attention layers, according to Alibaba. This design reduces computational requirements by activating only 3 billion parameters per token while maintaining the capacity of the full 35 billion parameter model.
Qwen3.6-35B-A3B supports a native context window of 262,144 tokens, extensible to 1 million tokens using YaRN (Yet another RoPE extensioN method). The model accepts text, image, and video inputs, making it a multimodal system.
Key Capabilities
The model includes:
- Reasoning mode: Integrated thinking capability with reasoning traces preserved across multi-turn conversations
- Function calling: Native support for tool use and function execution
- Structured output: Ability to generate formatted responses
- Multimodal processing: Handles text, images, and video inputs
Pricing and Availability
The model is available through OpenRouter at $0.1612 per million input tokens and $0.9653 per million output tokens. Alibaba has released it under the Apache 2.0 license, making the model weights freely available for commercial and research use.
The sparse MoE architecture positions Qwen3.6-35B-A3B as a cost-efficient alternative to dense models, as only 8.6% of parameters are active during inference.
What This Means
The sparse MoE approach with only 3B active parameters per token makes this 35B model competitive on inference cost with much smaller dense models while potentially retaining more knowledge capacity. The 262K native context window and multimodal capabilities make it suitable for document analysis and video understanding tasks. However, benchmark scores are not yet publicly available, making it difficult to assess performance relative to other models in its class. The Apache 2.0 license and availability through OpenRouter lower the barrier to adoption for developers seeking open-weight alternatives to proprietary models.
Related Articles
Nex AGI Releases Nex-N2-Pro: 17B Active Parameter MoE Model with 262K Context Window
Nex AGI has released Nex-N2-Pro, a mixture-of-experts model with 17 billion active parameters from a total of 397 billion parameters. Built on the Qwen3.5 architecture, the model offers a 262,144 token context window and is available for free through OpenRouter.
Nex AGI Releases Nex-N2-Pro: 397B Parameter MoE Model With 262K Context, Available Free
Nex AGI has released Nex-N2-Pro, an agentic mixture-of-experts model with 397B total parameters and 17B active parameters. The model features a 262K token context window and is available free via OpenRouter's API.
NVIDIA Releases Nemotron 3.5 Content Safety: 4B-Parameter Multimodal Model with Custom Policy Enforcement and 140-Langua
NVIDIA has released Nemotron 3.5 Content Safety, a 4B-parameter model built on Google Gemma 3 4B IT that provides multimodal safety classification across approximately 140 languages. The model includes a 128K context window, custom enterprise policy enforcement, auditable reasoning traces, and is releasing its training dataset.
Google DeepMind releases DiffusionGemma, a 26B parameter model generating 15-20 tokens per forward pass via discrete dif
Google DeepMind released DiffusionGemma, a 26B parameter mixture-of-experts model that generates text using discrete diffusion instead of autoregression. The model processes blocks of 256 tokens in parallel, achieving generation speeds exceeding 1100 tokens per second on H100 GPUs in low-batch settings.
Comments
Loading...