Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU
Cohere has released Command A+, an open-source sparse mixture-of-experts model with 218 billion total parameters and 25 billion active parameters. The model features W4A4 quantization allowing deployment on a single Nvidia B200 GPU, supports 128K input context, and includes built-in chain-of-thought reasoning with vision capabilities.
Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU
Cohere has released Command A+, an open-source sparse mixture-of-experts (MoE) model with 218 billion total parameters and 25 billion active parameters. The model's W4A4 quantization enables deployment on a single Nvidia B200 GPU, significantly reducing hardware requirements compared to full-precision variants.
Technical Specifications
Command A+ uses a decoder-only architecture with 128 experts, activating 8 per token plus one shared expert. The model supports:
- Context length: 128K input, 64K output
- Quantization options: BF16 (requires 4x B200 or 8x H100), FP8 (2x B200 or 4x H100), W4A4 (1x B200 or 2x H100)
- Languages: 48 languages including English, Chinese, Japanese, Korean, and major European and Asian languages
- Modalities: Text and image inputs
- License: Apache 2.0
Cohere states all three quantization levels show "negligible differences in benchmark quality," though specific benchmark scores were not disclosed in the model card.
Quantization Methodology
The W4A4 quantization applies NVFP4 4-bit precision to MoE experts only, while keeping attention layers (Q/K/V/O projections and KV cache) at full precision. According to Cohere, this selective approach addresses the "outsized quantization tax" that reasoning models typically face, where per-token errors compound during long decoding traces.
The company used Quantization-Aware Distillation (QAD) during post-training, training the quantized model to match the full-precision version's output distribution using fake quantization operators in forward passes and straight-through estimators on backward passes.
Architecture Details
The model interleaves sliding-window attention layers with rotational positional embeddings and global attention layers without positional embeddings in a 3:1 ratio, building on the architecture introduced in Command A. The sparse MoE layer uses a token-choice router with additive-bias-based load balancing and replaces the standard softmax activation with normalized sigmoid over top-k expert logits.
Command A+ includes native chain-of-thought reasoning, generating intermediate thinking steps between <START_THINKING> and <END_THINKING> tags. The model also supports conversational tool use through JSON schema-based function calling integrated into the chat template.
Deployment Requirements
The W4A4 variant requires vLLM version 0.21.0 or higher and Cohere's melody library (version 0.9.0+) for accurate response parsing. Cohere recommends sampling parameters of temperature=0.9, top_p=0.95, and repetition_penalty=1.04.
Pricing information for API access has not been disclosed.
What This Means
Command A+'s ability to run on a single B200 GPU represents a meaningful reduction in deployment costs for models in the 200B+ parameter range. The selective quantization approach—preserving full precision in attention while compressing experts—suggests a practical path for maintaining reasoning quality while reducing memory footprint. However, without published benchmark scores, it's unclear how Command A+ compares to frontier models like GPT-4, Claude 3.5 Sonnet, or DeepSeek V3 on standard reasoning tasks. The model's support for 48 languages and tool use capabilities positions it for enterprise agentic applications, though real-world performance validation remains to be seen.
Related Articles
Cohere Releases Command A+ Open Source Model with 25B Active Parameters, 128K Context
Cohere has released Command A+ as an open source model under Apache 2.0 license. The sparse mixture-of-experts architecture features 25 billion active parameters out of 218B total parameters, supports 128K input context length, and includes vision capabilities alongside tool use and reasoning features.
Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens
Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.
Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis
Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.
Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June
Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.
Comments
Loading...