Holo3 achieves 78.85% on OSWorld benchmark with only 10B active parameters
H Company unveiled Holo3, a computer use model that scores 78.85% on the OSWorld-Verified benchmark—the highest on the leading desktop automation benchmark. The model achieves this with only 10B active parameters (122B total), positioning it as a lower-cost alternative to proprietary models like GPT 5.4 and Opus 4.6.
Holo3 Achieves New State-of-the-Art on OSWorld Computer Use Benchmark
H Company announced Holo3, a specialized model for autonomous computer use that scores 78.85% on the OSWorld-Verified benchmark—the industry's leading desktop automation benchmark. The company claims this represents a new state-of-the-art performance on the measure.
Model Specifications and Availability
The flagship variant, Holo3-122B-A10B, uses a mixture-of-experts architecture with 122B total parameters but only 10B active parameters per inference step. This design aims to reduce computational cost compared to dense proprietary models.
H Company offers two tiers of access:
- Holo3-35B-A3B: Weights openly available on Hugging Face under Apache 2.0 license, with free tier access through H Company's inference API
- Holo3-122B-A10B: Available exclusively through H Company's Inference API
Pricing has not yet been disclosed. The company positions the models as lower-cost alternatives to GPT 5.4 and Opus 4.6, though specific per-token pricing was not provided.
Training Approach: The Agentic Learning Flywheel
Holo3 uses a three-stage training pipeline:
- Synthetic Navigation Data: Generated scenario-specific navigation examples from human and automated instructions
- Out-of-Domain Augmentation: Programmatic extension of scenarios to handle unexpected variations
- Curated Reinforcement Learning: Advanced data filtering and RL optimization applied to all training samples
The company emphasizes that its training methodology—called the "agentic learning flywheel"—focuses on two core capabilities: perception (visual grounding on UI elements) and decision-making (action sequencing).
Internal Benchmarking: H Corporate Benchmarks
Beyond OSWorld validation, H Company developed proprietary H Corporate Benchmarks containing 486 multi-step tasks across four categories:
- E-commerce workflows
- Business software operations
- Collaboration tools
- Multi-application workflows requiring cross-system coordination
Tasks range from single-application focus to complex multi-app scenarios—such as retrieving equipment prices from PDFs, cross-referencing employee budgets, and sending personalized approval emails. According to the company, Holo3 outperforms larger base models (including Qwen 3.5 variants) on these single-application benchmarks despite having significantly fewer parameters.
H Company built these benchmarks using a "Synthetic Environment Factory" that automatically generates websites and enterprise applications via coding agents, then validates task completion with verification scripts.
What This Means
Holo3 demonstrates that specialized training for computer use tasks can rival or exceed dense proprietary models at lower parameter counts. The 78.85% OSWorld score is competitive with publicly disclosed results from other vendors, though direct comparison requires reviewing their methodologies and benchmark versions.
The mixture-of-experts architecture with 10B active parameters is operationally significant—it suggests meaningful efficiency gains in production deployment compared to dense 122B models, which could translate to lower latency and infrastructure costs.
The open-sourcing of Holo3-35B-A3B under Apache 2.0 gives developers access to a computer use model without licensing restrictions, though performance on OSWorld for this smaller variant was not disclosed. H Company's investment in synthetic environments and internal benchmarking suggests confidence that the model generalizes beyond these controlled settings, but real-world enterprise performance data remains absent.
The stated next frontier—"Adaptive Agency" enabling models to autonomously learn new enterprise software in real-time—remains a claim rather than a demonstrated capability.
Related Articles
Claude Sonnet 5 launches on AWS Bedrock with Opus-level intelligence at Sonnet pricing
Anthropic has released Claude Sonnet 5 on Amazon Bedrock and Claude Platform on AWS. The model delivers what Anthropic describes as near-Opus intelligence while maintaining Sonnet-tier pricing, with promotional rates available through August 31, 2026.
DeepReinforce Releases Ornith-1.0, Open-Source Agentic Coding Model in 9B to 397B Sizes
DeepReinforce has released Ornith-1.0, an MIT-licensed model designed for agentic coding tasks with variants ranging from 9B to 397B parameters. Built on top of Apache 2.0-licensed Gemma 4 and Qwen 3.5 base models, the company claims it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.
DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3
DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.
Trump administration approves Anthropic's Mythos 5 release to 100 companies and federal agencies
The U.S. government approved Anthropic's release of its Mythos 5 model to roughly 100 companies and federal agencies on Friday. The limited distribution marks a controlled rollout requiring government clearance.
Comments
Loading...