Arcee AI releases Trinity-Large-Thinking: 398B sparse MoE model with chain-of-thought reasoning
Arcee AI released Trinity-Large-Thinking, a 398B-parameter sparse Mixture-of-Experts model with approximately 13B active parameters per token, post-trained with extended chain-of-thought reasoning for agentic workflows. The model achieves 94.7% on τ²-Bench, 91.9% on PinchBench, and 98.2% on LiveCodeBench, generating explicit reasoning traces in <think>...</think> blocks before producing responses.
Trinity Large Thinking — Quick Specs
Arcee AI Releases Trinity-Large-Thinking: 398B Sparse MoE Model for Agent Reasoning
Arcee AI released Trinity-Large-Thinking, a 398B-parameter sparse Mixture-of-Experts model with approximately 13B active parameters per token. The model is post-trained with extended chain-of-thought reasoning and agentic reinforcement learning, designed specifically for tool calling, multi-step planning, and agent workflows.
Architecture and Specifications
Trinity-Large-Thinking uses a sparse MoE architecture with 256 total experts (1 shared), of which 4 are active per token, resulting in 1.56% sparsity. The model was trained on 17 trillion tokens using 2,048 NVIDIA B300 GPUs with HSDP and Expert Parallelism, with data partnership from Datology and compute partnership from Prime Intellect.
The pretraining context length is 8,192 tokens, extended to 512k after context length extension. Post-training included instruction tuning and agentic RL with extended chain-of-thought, trained on tool-calling trajectories and multi-step reasoning chains.
Reasoning and Tool Calling
Trinity-Large-Thinking generates explicit reasoning traces wrapped in <think>...</think> blocks before producing final responses. When served via vLLM, reasoning is exposed in a dedicated reasoning_content field in the API response. For multi-turn agentic loops, the full thinking blocks must be preserved in conversation history for subsequent turns—stripping thinking tokens breaks the model's prior reasoning chain.
vLLM deployments require --enable-reasoning --reasoning-parser deepseek_r1 and --enable-auto-tool-choice --tool-call-parser qwen3_coder flags to fully expose reasoning content and structured tool calls in OpenAI-compatible format.
Benchmark Performance
Trinity-Large-Thinking reports the following benchmark scores:
- τ²-Airline: 88.0% (vs. Opus 4.6: 82.0%)
- τ²-Telecom: 94.7% (vs. Opus 4.6: 92.1%)
- PinchBench: 91.9% (vs. Opus 4.6: 93.3%)
- LiveCodeBench: 98.2%
- GPQA-Diamond: 76.3% (vs. Opus 4.6: 89.2%)
- AIME25: 96.3% (vs. Opus 4.6: 99.8%)
- MMLU-Pro: 83.4% (vs. Opus 4.6: 89.1%)
- SWE-bench Verified: 63.2% (evaluated in mini-swe-agent-v2; vs. Opus 4.6: 75.6%)
- BCFLv4: 70.1% (vs. Opus 4.6: 77.0%)
The model shows strong performance on agentic benchmarks (τ²-Telecom, PinchBench, LiveCodeBench) but trails Claude Opus on general reasoning benchmarks (GPQA-Diamond, AIME25).
Availability and Deployment
Trinity-Large-Thinking is available via:
- OpenRouter API: No setup required; full reasoning and tool-calling support
- vLLM: Recommended for agentic deployments (supported in vLLM 0.11.1+)
- Hugging Face: Direct model download with
trust_remote_code=True - Chat interface: chat.arcee.ai
The model works as a drop-in replacement for OpenClaw and Hermes Agent frameworks, with native tool-calling format compatible with agent execution loops.
Model Family
Trinity-Large-Thinking is one of four checkpoints in the Trinity-Large family:
- Trinity-Large-Thinking: Reasoning-optimized with agentic post-training (this release)
- Trinity-Large-Preview: Lightly post-trained, chat-ready instruct model without reasoning content
- Trinity-Large-TrueBase: 10T-token pre-anneal pretraining checkpoint
- Trinity-Large-Base: Full 17T-token pretrained foundation model with mid-training anneals
What This Means
Arcee AI's Trinity-Large-Thinking represents a specialized approach to reasoning models, prioritizing agentic performance over general-purpose capabilities. The model excels on task-oriented benchmarks (94.7% τ²-Telecom, 91.9% PinchBench) but underperforms on knowledge-heavy benchmarks compared to Claude Opus, suggesting it trades breadth for depth in agent-specific reasoning. The 512k context window and explicit reasoning traces make it technically suited for long-running agent loops, though real-world performance depends on proper context management—requiring users to preserve thinking tokens across multi-turn conversations. Availability on OpenRouter removes deployment friction for developers building agentic systems.
Related Articles
Cohere Releases Command A+ Open Source Model with 25B Active Parameters, 128K Context
Cohere has released Command A+ as an open source model under Apache 2.0 license. The sparse mixture-of-experts architecture features 25 billion active parameters out of 218B total parameters, supports 128K input context length, and includes vision capabilities alongside tool use and reasoning features.
Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU
Cohere has released Command A+, an open-source sparse mixture-of-experts model with 218 billion total parameters and 25 billion active parameters. The model features W4A4 quantization allowing deployment on a single Nvidia B200 GPU, supports 128K input context, and includes built-in chain-of-thought reasoning with vision capabilities.
Tencent Releases Hy-MT2 Translation Models: 1.8B, 7B, and 30B-A3B Support 33 Languages
Tencent released Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B (MoE) sizes. All models support translation among 33 languages and follow translation instructions in multiple languages. The 1.8B model can be compressed to 440MB using 1.25-bit AngelSlim quantization.
Tencent Releases Hy-MT2: 1.8B Translation Model Compressed to 440MB With 1.25-Bit Quantization
Tencent has open-sourced Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B parameter sizes. The models support translation across 33 languages and include extreme quantization down to 1.25-bit, reducing the 1.8B model to 440MB storage while increasing inference speed by 1.5x.
Comments
Loading...