Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU
Cohere has released Command A+, an open-source sparse mixture-of-experts model with 218 billion total parameters and 25 billion active parameters. The model features W4A4 quantization allowing deployment on a single Nvidia B200 GPU, supports 128K input context, and includes built-in chain-of-thought reasoning with vision capabilities.
Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU
Cohere has released Command A+, an open-source sparse mixture-of-experts (MoE) model with 218 billion total parameters and 25 billion active parameters. The model's W4A4 quantization enables deployment on a single Nvidia B200 GPU, significantly reducing hardware requirements compared to full-precision variants.
Technical Specifications
Command A+ uses a decoder-only architecture with 128 experts, activating 8 per token plus one shared expert. The model supports:
- Context length: 128K input, 64K output
- Quantization options: BF16 (requires 4x B200 or 8x H100), FP8 (2x B200 or 4x H100), W4A4 (1x B200 or 2x H100)
- Languages: 48 languages including English, Chinese, Japanese, Korean, and major European and Asian languages
- Modalities: Text and image inputs
- License: Apache 2.0
Cohere states all three quantization levels show "negligible differences in benchmark quality," though specific benchmark scores were not disclosed in the model card.
Quantization Methodology
The W4A4 quantization applies NVFP4 4-bit precision to MoE experts only, while keeping attention layers (Q/K/V/O projections and KV cache) at full precision. According to Cohere, this selective approach addresses the "outsized quantization tax" that reasoning models typically face, where per-token errors compound during long decoding traces.
The company used Quantization-Aware Distillation (QAD) during post-training, training the quantized model to match the full-precision version's output distribution using fake quantization operators in forward passes and straight-through estimators on backward passes.
Architecture Details
The model interleaves sliding-window attention layers with rotational positional embeddings and global attention layers without positional embeddings in a 3:1 ratio, building on the architecture introduced in Command A. The sparse MoE layer uses a token-choice router with additive-bias-based load balancing and replaces the standard softmax activation with normalized sigmoid over top-k expert logits.
Command A+ includes native chain-of-thought reasoning, generating intermediate thinking steps between <START_THINKING> and <END_THINKING> tags. The model also supports conversational tool use through JSON schema-based function calling integrated into the chat template.
Deployment Requirements
The W4A4 variant requires vLLM version 0.21.0 or higher and Cohere's melody library (version 0.9.0+) for accurate response parsing. Cohere recommends sampling parameters of temperature=0.9, top_p=0.95, and repetition_penalty=1.04.
Pricing information for API access has not been disclosed.
What This Means
Command A+'s ability to run on a single B200 GPU represents a meaningful reduction in deployment costs for models in the 200B+ parameter range. The selective quantization approach—preserving full precision in attention while compressing experts—suggests a practical path for maintaining reasoning quality while reducing memory footprint. However, without published benchmark scores, it's unclear how Command A+ compares to frontier models like GPT-4, Claude 3.5 Sonnet, or DeepSeek V3 on standard reasoning tasks. The model's support for 48 languages and tool use capabilities positions it for enterprise agentic applications, though real-world performance validation remains to be seen.
Related Articles
Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance
Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.
DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3
DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.
Portugal releases Amália, open-source 9B parameter AI model trained on European Portuguese
Portugal has released Amália, its first national AI model trained specifically for European Portuguese. Built on EuroLLM-9B with 9 billion parameters, the model is fully open-source with weights, datasets, and code published under an open license. The government has committed €5.5m in initial funding through 2027.
DeepReinforce Releases Ornith-1.0, Open-Source Agentic Coding Model in 9B to 397B Sizes
DeepReinforce has released Ornith-1.0, an MIT-licensed model designed for agentic coding tasks with variants ranging from 9B to 397B parameters. Built on top of Apache 2.0-licensed Gemma 4 and Qwen 3.5 base models, the company claims it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.
Comments
Loading...