Holo3 achieves 78.85% on OSWorld benchmark with only 10B active parameters
H Company unveiled Holo3, a computer use model that scores 78.85% on the OSWorld-Verified benchmark—the highest on the leading desktop automation benchmark. The model achieves this with only 10B active parameters (122B total), positioning it as a lower-cost alternative to proprietary models like GPT 5.4 and Opus 4.6.
Holo3 Achieves New State-of-the-Art on OSWorld Computer Use Benchmark
H Company announced Holo3, a specialized model for autonomous computer use that scores 78.85% on the OSWorld-Verified benchmark—the industry's leading desktop automation benchmark. The company claims this represents a new state-of-the-art performance on the measure.
Model Specifications and Availability
The flagship variant, Holo3-122B-A10B, uses a mixture-of-experts architecture with 122B total parameters but only 10B active parameters per inference step. This design aims to reduce computational cost compared to dense proprietary models.
H Company offers two tiers of access:
- Holo3-35B-A3B: Weights openly available on Hugging Face under Apache 2.0 license, with free tier access through H Company's inference API
- Holo3-122B-A10B: Available exclusively through H Company's Inference API
Pricing has not yet been disclosed. The company positions the models as lower-cost alternatives to GPT 5.4 and Opus 4.6, though specific per-token pricing was not provided.
Training Approach: The Agentic Learning Flywheel
Holo3 uses a three-stage training pipeline:
- Synthetic Navigation Data: Generated scenario-specific navigation examples from human and automated instructions
- Out-of-Domain Augmentation: Programmatic extension of scenarios to handle unexpected variations
- Curated Reinforcement Learning: Advanced data filtering and RL optimization applied to all training samples
The company emphasizes that its training methodology—called the "agentic learning flywheel"—focuses on two core capabilities: perception (visual grounding on UI elements) and decision-making (action sequencing).
Internal Benchmarking: H Corporate Benchmarks
Beyond OSWorld validation, H Company developed proprietary H Corporate Benchmarks containing 486 multi-step tasks across four categories:
- E-commerce workflows
- Business software operations
- Collaboration tools
- Multi-application workflows requiring cross-system coordination
Tasks range from single-application focus to complex multi-app scenarios—such as retrieving equipment prices from PDFs, cross-referencing employee budgets, and sending personalized approval emails. According to the company, Holo3 outperforms larger base models (including Qwen 3.5 variants) on these single-application benchmarks despite having significantly fewer parameters.
H Company built these benchmarks using a "Synthetic Environment Factory" that automatically generates websites and enterprise applications via coding agents, then validates task completion with verification scripts.
What This Means
Holo3 demonstrates that specialized training for computer use tasks can rival or exceed dense proprietary models at lower parameter counts. The 78.85% OSWorld score is competitive with publicly disclosed results from other vendors, though direct comparison requires reviewing their methodologies and benchmark versions.
The mixture-of-experts architecture with 10B active parameters is operationally significant—it suggests meaningful efficiency gains in production deployment compared to dense 122B models, which could translate to lower latency and infrastructure costs.
The open-sourcing of Holo3-35B-A3B under Apache 2.0 gives developers access to a computer use model without licensing restrictions, though performance on OSWorld for this smaller variant was not disclosed. H Company's investment in synthetic environments and internal benchmarking suggests confidence that the model generalizes beyond these controlled settings, but real-world enterprise performance data remains absent.
The stated next frontier—"Adaptive Agency" enabling models to autonomously learn new enterprise software in real-time—remains a claim rather than a demonstrated capability.
Related Articles
IBM releases Granite 4.0 3B Vision, compact multimodal model for enterprise document understanding
IBM announced Granite 4.0 3B Vision, a 3 billion parameter vision-language model designed for enterprise document processing. The model achieves 86.4% on Chart2Summary and 92.1% TEDS score on cropped table extraction, shipped as a LoRA adapter on Granite 4.0 Micro to enable modular text-only fallbacks.
NVIDIA releases gpt-oss-puzzle-88B, 88B-parameter reasoning model with 1.63× throughput gains
NVIDIA released gpt-oss-puzzle-88B on March 26, 2026, a 88-billion parameter mixture-of-experts model optimized for inference efficiency on H100 hardware. Built using the Puzzle post-training neural architecture search framework, the model achieves 1.63× throughput improvement in long-context (64K/64K) scenarios and up to 2.82× improvement on single H100 GPUs compared to its parent gpt-oss-120B, while matching or exceeding accuracy across reasoning effort levels.
Z.ai releases GLM-5V Turbo, native multimodal model for vision-based coding
Z.ai has released GLM-5V Turbo, a native multimodal foundation model designed for vision-based coding and agent-driven tasks. The model supports image, video, and text inputs with a 202,752 token context window, priced at $1.20 per million input tokens and $4 per million output tokens.
Arcee AI releases Trinity Large Thinking, open-source reasoning model with 262K context window
Arcee AI has released Trinity Large Thinking, an open-source reasoning model featuring a 262,144 token context window. The model is priced at $0.25 per million input tokens and $0.90 per million output tokens, with free access available through OpenRouter for the first five days.
Comments
Loading...