model release

Alibaba Releases Qwen3.6-35B-A3B: 35B Parameter MoE Model with 262K Context Window

TL;DR

Alibaba has released Qwen3.6-35B-A3B, the first open-weight model in the Qwen3.6 series. The model features 35B total parameters with 3B activated, a native 262K context window extensible to 1.01M tokens, and achieves 73.4% on SWE-bench Verified using 256 experts with 8 activated per token.

2 min read
1

Alibaba Releases Qwen3.6-35B-A3B: 35B Parameter MoE Model with 262K Context Window

Alibaba has released Qwen3.6-35B-A3B, the first open-weight variant in the Qwen3.6 series. The model features 35 billion total parameters with 3 billion activated per forward pass, using a mixture-of-experts architecture with 256 experts.

Architecture Specifications

The model employs a distinctive architecture combining Gated DeltaNet and Gated Attention layers across 40 layers with a 2048 hidden dimension. The MoE configuration activates 8 experts plus 1 shared expert per token, with each expert having a 512 intermediate dimension.

Key specifications:

  • Context window: 262,144 tokens natively, extensible to 1,010,000 tokens
  • Token embedding: 248,320 (padded)
  • Training: Multi-step prediction (MTP)
  • Architecture: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))

Benchmark Performance

According to Alibaba, Qwen3.6-35B-A3B achieves substantial improvements in coding benchmarks:

Coding Agent Tasks:

  • SWE-bench Verified: 73.4%
  • SWE-bench Multilingual: 67.2%
  • SWE-bench Pro: 49.5%
  • Terminal-Bench 2.0: 51.5%
  • Claw-Eval Average: 68.7%

Knowledge Benchmarks:

  • MMLU-Pro: 85.2%
  • MMLU-Redux: 93.3%
  • C-Eval: 90.0%

STEM & Reasoning:

  • GPQA: 86.0%
  • LiveCodeBench v6: 80.4%
  • AIME 2026: 92.7%

Vision Language:

  • MMMU: 81.7%
  • MathVista (mini): 86.4%
  • RealWorldQA: 85.3%
  • VideoMMMU: 83.7%

All benchmarks were conducted using the company's internal evaluation harness with specific temperature and context window settings disclosed in their documentation.

Technical Features

The model introduces "thinking preservation," which retains reasoning context from historical messages to reduce computational overhead during iterative development. Alibaba claims this enhances the model's performance on repository-level reasoning and frontend workflows.

The architecture uses:

  • Gated DeltaNet: 32 linear attention heads for V, 16 for QK with 128 head dimension
  • Gated Attention: 16 attention heads for Q, 2 for KV with 256 head dimension
  • Rotary Position Embedding: 64 dimensions

Deployment

The model is compatible with SGLang (version 0.5.10+), vLLM (version 0.19+), and KTransformers. Alibaba recommends maintaining at least 128K token context length for optimal thinking capabilities, though this can be reduced if encountering memory constraints.

For serving, the company recommends tensor parallelism across 8 GPUs with 0.8 memory fraction for the full 262K context window. The model supports tool use and multi-token prediction modes.

Pricing

Pricing has not been disclosed. The model weights are available on Hugging Face under an open-weight license.

What This Means

Qwen3.6-35B-A3B demonstrates that MoE architectures with high expert counts (256) can achieve competitive performance on coding tasks while maintaining relatively low activation cost (3B parameters). The 73.4% SWE-bench Verified score positions it between Qwen3.5-27B (75.0%) and Qwen3.5-35B-A3B (70.0%), suggesting architectural refinements beyond pure parameter scaling. The extended context capability to 1M tokens addresses a key limitation for repository-level code understanding, though real-world performance at maximum context length remains to be independently verified.

Related Articles

model release

Anthropic releases Claude Opus 4.7 with improved coding and vision, confirms it trails unreleased Mythos model

Anthropic released Claude Opus 4.7 with improved coding capabilities, higher-resolution vision, and a new reasoning level. The company publicly acknowledged the model underperforms its unreleased Mythos system, which remains restricted due to safety concerns.

model release

Anthropic ships Claude Opus 4.7 with improved coding reliability and multimodal capabilities

Anthropic has released Claude Opus 4.7, its latest generally available AI model focused on advanced software engineering. The model shows improvements in handling complex coding tasks with less supervision, enhanced vision capabilities, and better instruction following, while introducing a new tokenizer that increases token usage by 1.0-1.35× depending on content type.

model release

Anthropic releases Claude Opus 4.7 with reduced cyber capabilities ahead of Mythos Preview general release

Anthropic has released Claude Opus 4.7, its most powerful generally available model, though it scores lower than the company's Mythos Preview model on every evaluation. The company intentionally reduced Opus 4.7's cybersecurity capabilities during training as it tests safety measures before releasing more powerful models.

model release

Anthropic releases Claude Opus 4.7 with 1M context window for long-running agent tasks

Anthropic has released Claude Opus 4.7, the latest version of its flagship Opus family designed for long-running, asynchronous agent tasks. The model features a 1 million token context window and costs $5 per million input tokens and $25 per million output tokens.

Comments

Loading...