Holo3 achieves 78.85% on OSWorld benchmark with only 10B active parameters
H Company unveiled Holo3, a computer use model that scores 78.85% on the OSWorld-Verified benchmark—the highest on the leading desktop automation benchmark. The model achieves this with only 10B active parameters (122B total), positioning it as a lower-cost alternative to proprietary models like GPT 5.4 and Opus 4.6.
Holo3 Achieves New State-of-the-Art on OSWorld Computer Use Benchmark
H Company announced Holo3, a specialized model for autonomous computer use that scores 78.85% on the OSWorld-Verified benchmark—the industry's leading desktop automation benchmark. The company claims this represents a new state-of-the-art performance on the measure.
Model Specifications and Availability
The flagship variant, Holo3-122B-A10B, uses a mixture-of-experts architecture with 122B total parameters but only 10B active parameters per inference step. This design aims to reduce computational cost compared to dense proprietary models.
H Company offers two tiers of access:
- Holo3-35B-A3B: Weights openly available on Hugging Face under Apache 2.0 license, with free tier access through H Company's inference API
- Holo3-122B-A10B: Available exclusively through H Company's Inference API
Pricing has not yet been disclosed. The company positions the models as lower-cost alternatives to GPT 5.4 and Opus 4.6, though specific per-token pricing was not provided.
Training Approach: The Agentic Learning Flywheel
Holo3 uses a three-stage training pipeline:
- Synthetic Navigation Data: Generated scenario-specific navigation examples from human and automated instructions
- Out-of-Domain Augmentation: Programmatic extension of scenarios to handle unexpected variations
- Curated Reinforcement Learning: Advanced data filtering and RL optimization applied to all training samples
The company emphasizes that its training methodology—called the "agentic learning flywheel"—focuses on two core capabilities: perception (visual grounding on UI elements) and decision-making (action sequencing).
Internal Benchmarking: H Corporate Benchmarks
Beyond OSWorld validation, H Company developed proprietary H Corporate Benchmarks containing 486 multi-step tasks across four categories:
- E-commerce workflows
- Business software operations
- Collaboration tools
- Multi-application workflows requiring cross-system coordination
Tasks range from single-application focus to complex multi-app scenarios—such as retrieving equipment prices from PDFs, cross-referencing employee budgets, and sending personalized approval emails. According to the company, Holo3 outperforms larger base models (including Qwen 3.5 variants) on these single-application benchmarks despite having significantly fewer parameters.
H Company built these benchmarks using a "Synthetic Environment Factory" that automatically generates websites and enterprise applications via coding agents, then validates task completion with verification scripts.
What This Means
Holo3 demonstrates that specialized training for computer use tasks can rival or exceed dense proprietary models at lower parameter counts. The 78.85% OSWorld score is competitive with publicly disclosed results from other vendors, though direct comparison requires reviewing their methodologies and benchmark versions.
The mixture-of-experts architecture with 10B active parameters is operationally significant—it suggests meaningful efficiency gains in production deployment compared to dense 122B models, which could translate to lower latency and infrastructure costs.
The open-sourcing of Holo3-35B-A3B under Apache 2.0 gives developers access to a computer use model without licensing restrictions, though performance on OSWorld for this smaller variant was not disclosed. H Company's investment in synthetic environments and internal benchmarking suggests confidence that the model generalizes beyond these controlled settings, but real-world enterprise performance data remains absent.
The stated next frontier—"Adaptive Agency" enabling models to autonomously learn new enterprise software in real-time—remains a claim rather than a demonstrated capability.
Related Articles
DeepSeek Releases V4 Flash: 284B-Parameter MoE Model with 1M Context Window, Free via OpenRouter
DeepSeek has released V4 Flash, a Mixture-of-Experts model with 284B total parameters and 13B activated parameters per forward pass. The model supports a 1M-token context window and is available free through OpenRouter, targeting high-throughput coding and chat applications.
Tencent Releases Hy3 Preview: Mixture-of-Experts Model with 262K Context and Configurable Reasoning
Tencent has released Hy3 preview, a Mixture-of-Experts model with a 262,144 token context window priced at $0.066 per million input tokens and $0.26 per million output tokens. The model features three configurable reasoning modes—disabled, low, and high—designed for agentic workflows and production environments.
Allen Institute releases EMO, 14B parameter MoE model with selective 12.5% expert use
Allen Institute for AI released EMO, a 1B-active, 14B-total-parameter mixture-of-experts model trained on 1 trillion tokens. The model uses 8 active experts per token from a pool of 128 total experts, and can maintain near full-model performance while using just 12.5% of its experts for specific tasks.
Zyphra Releases ZAYA1-8B: 8.4B Parameter MoE Model with 760M Active Parameters Matches 80B+ Models on Math Benchmarks
Zyphra has released ZAYA1-8B, a mixture-of-experts language model with 760M active parameters and 8.4B total parameters. The model scores 89.1% on AIME 2026, competitive with models exceeding 100B parameters, while maintaining efficiency for on-device deployment.
Comments
Loading...