model releaseArcee Ai

Arcee AI releases Trinity-Large-Thinking: 398B sparse MoE model with chain-of-thought reasoning

TL;DR

Arcee AI released Trinity-Large-Thinking, a 398B-parameter sparse Mixture-of-Experts model with approximately 13B active parameters per token, post-trained with extended chain-of-thought reasoning for agentic workflows. The model achieves 94.7% on τ²-Bench, 91.9% on PinchBench, and 98.2% on LiveCodeBench, generating explicit reasoning traces in <think>...</think> blocks before producing responses.

3 min read
0

Arcee AI Releases Trinity-Large-Thinking: 398B Sparse MoE Model for Agent Reasoning

Arcee AI released Trinity-Large-Thinking, a 398B-parameter sparse Mixture-of-Experts model with approximately 13B active parameters per token. The model is post-trained with extended chain-of-thought reasoning and agentic reinforcement learning, designed specifically for tool calling, multi-step planning, and agent workflows.

Architecture and Specifications

Trinity-Large-Thinking uses a sparse MoE architecture with 256 total experts (1 shared), of which 4 are active per token, resulting in 1.56% sparsity. The model was trained on 17 trillion tokens using 2,048 NVIDIA B300 GPUs with HSDP and Expert Parallelism, with data partnership from Datology and compute partnership from Prime Intellect.

The pretraining context length is 8,192 tokens, extended to 512k after context length extension. Post-training included instruction tuning and agentic RL with extended chain-of-thought, trained on tool-calling trajectories and multi-step reasoning chains.

Reasoning and Tool Calling

Trinity-Large-Thinking generates explicit reasoning traces wrapped in <think>...</think> blocks before producing final responses. When served via vLLM, reasoning is exposed in a dedicated reasoning_content field in the API response. For multi-turn agentic loops, the full thinking blocks must be preserved in conversation history for subsequent turns—stripping thinking tokens breaks the model's prior reasoning chain.

vLLM deployments require --enable-reasoning --reasoning-parser deepseek_r1 and --enable-auto-tool-choice --tool-call-parser qwen3_coder flags to fully expose reasoning content and structured tool calls in OpenAI-compatible format.

Benchmark Performance

Trinity-Large-Thinking reports the following benchmark scores:

  • τ²-Airline: 88.0% (vs. Opus 4.6: 82.0%)
  • τ²-Telecom: 94.7% (vs. Opus 4.6: 92.1%)
  • PinchBench: 91.9% (vs. Opus 4.6: 93.3%)
  • LiveCodeBench: 98.2%
  • GPQA-Diamond: 76.3% (vs. Opus 4.6: 89.2%)
  • AIME25: 96.3% (vs. Opus 4.6: 99.8%)
  • MMLU-Pro: 83.4% (vs. Opus 4.6: 89.1%)
  • SWE-bench Verified: 63.2% (evaluated in mini-swe-agent-v2; vs. Opus 4.6: 75.6%)
  • BCFLv4: 70.1% (vs. Opus 4.6: 77.0%)

The model shows strong performance on agentic benchmarks (τ²-Telecom, PinchBench, LiveCodeBench) but trails Claude Opus on general reasoning benchmarks (GPQA-Diamond, AIME25).

Availability and Deployment

Trinity-Large-Thinking is available via:

  • OpenRouter API: No setup required; full reasoning and tool-calling support
  • vLLM: Recommended for agentic deployments (supported in vLLM 0.11.1+)
  • Hugging Face: Direct model download with trust_remote_code=True
  • Chat interface: chat.arcee.ai

The model works as a drop-in replacement for OpenClaw and Hermes Agent frameworks, with native tool-calling format compatible with agent execution loops.

Model Family

Trinity-Large-Thinking is one of four checkpoints in the Trinity-Large family:

  • Trinity-Large-Thinking: Reasoning-optimized with agentic post-training (this release)
  • Trinity-Large-Preview: Lightly post-trained, chat-ready instruct model without reasoning content
  • Trinity-Large-TrueBase: 10T-token pre-anneal pretraining checkpoint
  • Trinity-Large-Base: Full 17T-token pretrained foundation model with mid-training anneals

What This Means

Arcee AI's Trinity-Large-Thinking represents a specialized approach to reasoning models, prioritizing agentic performance over general-purpose capabilities. The model excels on task-oriented benchmarks (94.7% τ²-Telecom, 91.9% PinchBench) but underperforms on knowledge-heavy benchmarks compared to Claude Opus, suggesting it trades breadth for depth in agent-specific reasoning. The 512k context window and explicit reasoning traces make it technically suited for long-running agent loops, though real-world performance depends on proper context management—requiring users to preserve thinking tokens across multi-turn conversations. Availability on OpenRouter removes deployment friction for developers building agentic systems.

Related Articles

model release

Google DeepMind releases Gemma 4 with four model sizes, up to 256K context, multimodal support

Google DeepMind released Gemma 4, an open-weights multimodal model family in four sizes (2.3B to 31B parameters) with context windows up to 256K tokens. All models support text and image input, with audio native to E2B and E4B variants. The Gemma 4 31B dense model scores 85.2% on MMLU Pro, 89.2% on AIME 2026, and 80.0% on LiveCodeBench—significant improvements over Gemma 3.

model release

Alibaba's Qwen3.6 Plus reaches 78.8 on SWE-bench with 1M context window

Alibaba released Qwen3.6 Plus on April 2, 2026, featuring a 1 million token context window at $0.50 per million input tokens and $3 per million output tokens. The model combines linear attention with sparse mixture-of-experts routing to achieve a 78.8 score on SWE-bench Verified, with significant improvements in agentic coding, front-end development, and reasoning tasks.

model release

Google releases Gemma 4 26B with 256K context and multimodal support, free to use

Google DeepMind has released Gemma 4 26B A4B, a free instruction-tuned Mixture-of-Experts model with 262,144 token context window and multimodal capabilities including text, images, and video input. Despite 25.2B total parameters, only 3.8B activate per token, delivering performance comparable to larger 31B models at reduced compute cost.

model release

Google DeepMind releases Gemma 4 family: multimodal models from 2.3B to 31B parameters with 256K context

Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active parameters), and 31B dense. All models support text and image input with 128K-256K context windows; E2B and E4B add native audio capabilities. Models feature reasoning modes, function calling, and multilingual support across 140+ languages.

Comments

Loading...