AI2 releases MolmoWeb, open web agent matching proprietary systems with 8B parameters
The Allen Institute for AI has released MolmoWeb, a fully open web agent that operates websites using only screenshots without access to source code. The 8B-parameter model achieves 78.2% success on WebVoyager—nearly matching OpenAI's o3 at 79.3%—while being trained on one of the largest public web task datasets ever released.
AI2 Releases MolmoWeb: Fully Open Web Agent Matching Proprietary Performance
The Allen Institute for AI (AI2) has released MolmoWeb, a fully open web agent that navigates websites using only screenshots, with all training data, model weights, and evaluation tools freely available under Apache 2.0 license.
Model Specifications and Performance
MolmoWeb comes in two sizes: 4B and 8B parameters, built on the Molmo2 architecture with Qwen3 as the language model and SigLIP2 as the vision encoder. Despite their compact size, both models significantly outperform the previous best open-source web agent (Fara-7B) across all tested benchmarks.
On WebVoyager, which tests navigation across 15 popular sites including GitHub and Google Flights, the 8B model achieves 78.2% success rate—only 1.1 percentage points behind OpenAI's o3 at 79.3%. On DeepShop, MolmoWeb-8B trails GPT-5 by only 6 points. Both models beat several larger proprietary agents built on GPT-4o that had access to annotated screenshots and structured page data.
Performance improves substantially with inference-time compute: running tasks multiple times and selecting the best result (pass@4) increases WebVoyager success from 78.2% to 94.7%.
Training Approach and Dataset
The model was trained using supervised fine-tuning on 64 H100 GPUs with no reinforcement learning or distillation from proprietary systems. Training relied on a novel hybrid approach combining human demonstrations with automatically generated runs.
MolmoWeb's primary contribution may be MolmoWebMix, described as the largest public dataset of human web task execution available. The dataset includes:
- 36,000 complete task runs across 1,100+ websites from crowdworkers
- Over 2.2 million screenshot-question-answer pairs for web content understanding
- Over 7 million UI element localization examples
- Automatically generated synthetic runs using a three-role system: Gemini 2.5 Flash as planner, an operator for browser actions, and GPT-4o as verifier
A counterintuitive finding: synthetic browsing runs outperformed human demonstrations on identical tasks. Researchers attribute this to humans taking detours on unfamiliar sites while automated agents find more direct paths. Data ablations show that just 10% of the dataset delivers 85-90% of final performance.
How MolmoWeb Operates
Unlike proprietary web agents that access page source code or DOM structure, MolmoWeb works exclusively with what humans see on screen—receiving only screenshots, task descriptions, and action history. The agent formulates a thought, then performs the next action: clicking, scrolling, typing, switching tabs, or entering URLs. It then captures a new screenshot and repeats.
AI2 argues this screenshot-only approach is more robust since visual appearance changes less frequently than underlying code, and makes agent reasoning easier to audit and follow.
Limitations and Safety Measures
MolmoWeb cannot reliably read all text in screenshots and performance degrades with vague instructions or multiple constraints. The team deliberately excluded tasks requiring logins or financial transactions from training data.
The hosted demo enforces guardrails: blocking password and credit card fields, limiting access to certain websites, and using Google's interface for content screening. These restrictions apply to the demo only, not the model itself.
AI2 acknowledges unanswered questions remain: how web agents should handle terms of service, prevent access to illegal content, and avoid taking irreversible actions. The team argues full openness enables broader community collaboration on these safety challenges.
What This Means
MolmoWeb demonstrates that open-source web agents can match near-proprietary performance using substantially smaller models and public training data. By releasing the dataset, model weights, and code, AI2 eliminates the primary bottleneck preventing open-source progress in web automation. The 78.2% WebVoyager score with an 8B model versus OpenAI's proprietary systems shows diminishing returns to scale in this domain. However, questions about responsible deployment—especially around financial transactions, authentication, and unauthorized access—remain critical unresolved issues that openness alone doesn't solve.
Related Articles
Reka releases Reka Edge, a 7B multimodal model for efficient image and video understanding
Reka has released Reka Edge, a 7-billion parameter multimodal model designed for efficient image and video understanding. The model features a 16,384 token context window and is priced at $0.20 per million input and output tokens.
Stability AI releases Stable Audio Open Small for on-device audio generation with Arm
Stability AI has open-sourced Stable Audio Open Small in partnership with Arm, a smaller and faster variant of its text-to-audio model designed for on-device deployment. The model maintains output quality and prompt adherence while reducing computational requirements for real-world edge deployment on devices powered by Arm's technology, which runs on 99% of smartphones globally.
Rakuten releases RakutenAI-3.0, 671B-parameter Japanese-optimized mixture-of-experts model
Rakuten Group has released RakutenAI-3.0, a 671 billion parameter mixture-of-experts (MoE) model designed specifically for Japanese language tasks. The model activates 37 billion parameters per token and supports a 128K context window. It is available under the Apache License 2.0 on Hugging Face.
Nvidia releases Nemotron 3 Super: 120B MoE model with 1M token context
Nvidia has released Nemotron 3 Super, a 120-billion parameter hybrid Mamba-Transformer Mixture-of-Experts model that activates only 12 billion parameters during inference. The open-weight model features a 1-million token context window, multi-token prediction capabilities, and pricing at $0.10 per million input tokens and $0.50 per million output tokens.
Comments
Loading...