Alibaba Releases Qwen3.6-35B-A3B: 35B Parameter MoE Model with 262K Context Window
Alibaba has released Qwen3.6-35B-A3B, the first open-weight model in the Qwen3.6 series. The model features 35B total parameters with 3B activated, a native 262K context window extensible to 1.01M tokens, and achieves 73.4% on SWE-bench Verified using 256 experts with 8 activated per token.
Alibaba Releases Qwen3.6-35B-A3B: 35B Parameter MoE Model with 262K Context Window
Alibaba has released Qwen3.6-35B-A3B, the first open-weight variant in the Qwen3.6 series. The model features 35 billion total parameters with 3 billion activated per forward pass, using a mixture-of-experts architecture with 256 experts.
Architecture Specifications
The model employs a distinctive architecture combining Gated DeltaNet and Gated Attention layers across 40 layers with a 2048 hidden dimension. The MoE configuration activates 8 experts plus 1 shared expert per token, with each expert having a 512 intermediate dimension.
Key specifications:
- Context window: 262,144 tokens natively, extensible to 1,010,000 tokens
- Token embedding: 248,320 (padded)
- Training: Multi-step prediction (MTP)
- Architecture: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))
Benchmark Performance
According to Alibaba, Qwen3.6-35B-A3B achieves substantial improvements in coding benchmarks:
Coding Agent Tasks:
- SWE-bench Verified: 73.4%
- SWE-bench Multilingual: 67.2%
- SWE-bench Pro: 49.5%
- Terminal-Bench 2.0: 51.5%
- Claw-Eval Average: 68.7%
Knowledge Benchmarks:
- MMLU-Pro: 85.2%
- MMLU-Redux: 93.3%
- C-Eval: 90.0%
STEM & Reasoning:
- GPQA: 86.0%
- LiveCodeBench v6: 80.4%
- AIME 2026: 92.7%
Vision Language:
- MMMU: 81.7%
- MathVista (mini): 86.4%
- RealWorldQA: 85.3%
- VideoMMMU: 83.7%
All benchmarks were conducted using the company's internal evaluation harness with specific temperature and context window settings disclosed in their documentation.
Technical Features
The model introduces "thinking preservation," which retains reasoning context from historical messages to reduce computational overhead during iterative development. Alibaba claims this enhances the model's performance on repository-level reasoning and frontend workflows.
The architecture uses:
- Gated DeltaNet: 32 linear attention heads for V, 16 for QK with 128 head dimension
- Gated Attention: 16 attention heads for Q, 2 for KV with 256 head dimension
- Rotary Position Embedding: 64 dimensions
Deployment
The model is compatible with SGLang (version 0.5.10+), vLLM (version 0.19+), and KTransformers. Alibaba recommends maintaining at least 128K token context length for optimal thinking capabilities, though this can be reduced if encountering memory constraints.
For serving, the company recommends tensor parallelism across 8 GPUs with 0.8 memory fraction for the full 262K context window. The model supports tool use and multi-token prediction modes.
Pricing
Pricing has not been disclosed. The model weights are available on Hugging Face under an open-weight license.
What This Means
Qwen3.6-35B-A3B demonstrates that MoE architectures with high expert counts (256) can achieve competitive performance on coding tasks while maintaining relatively low activation cost (3B parameters). The 73.4% SWE-bench Verified score positions it between Qwen3.5-27B (75.0%) and Qwen3.5-35B-A3B (70.0%), suggesting architectural refinements beyond pure parameter scaling. The extended context capability to 1M tokens addresses a key limitation for repository-level code understanding, though real-world performance at maximum context length remains to be independently verified.
Related Articles
Anthropic releases Claude Opus 4.7 with improved coding and vision, confirms it trails unreleased Mythos model
Anthropic released Claude Opus 4.7 with improved coding capabilities, higher-resolution vision, and a new reasoning level. The company publicly acknowledged the model underperforms its unreleased Mythos system, which remains restricted due to safety concerns.
Anthropic ships Claude Opus 4.7 with improved coding reliability and multimodal capabilities
Anthropic has released Claude Opus 4.7, its latest generally available AI model focused on advanced software engineering. The model shows improvements in handling complex coding tasks with less supervision, enhanced vision capabilities, and better instruction following, while introducing a new tokenizer that increases token usage by 1.0-1.35× depending on content type.
Anthropic releases Claude Opus 4.7 with reduced cyber capabilities ahead of Mythos Preview general release
Anthropic has released Claude Opus 4.7, its most powerful generally available model, though it scores lower than the company's Mythos Preview model on every evaluation. The company intentionally reduced Opus 4.7's cybersecurity capabilities during training as it tests safety measures before releasing more powerful models.
Anthropic releases Claude Opus 4.7 with 1M context window for long-running agent tasks
Anthropic has released Claude Opus 4.7, the latest version of its flagship Opus family designed for long-running, asynchronous agent tasks. The model features a 1 million token context window and costs $5 per million input tokens and $25 per million output tokens.
Comments
Loading...