model release

AI2 releases MolmoWeb, open web agent matching proprietary systems with 8B parameters

TL;DR

The Allen Institute for AI has released MolmoWeb, a fully open web agent that operates websites using only screenshots without access to source code. The 8B-parameter model achieves 78.2% success on WebVoyager—nearly matching OpenAI's o3 at 79.3%—while being trained on one of the largest public web task datasets ever released.

3 min read
0

AI2 Releases MolmoWeb: Fully Open Web Agent Matching Proprietary Performance

The Allen Institute for AI (AI2) has released MolmoWeb, a fully open web agent that navigates websites using only screenshots, with all training data, model weights, and evaluation tools freely available under Apache 2.0 license.

Model Specifications and Performance

MolmoWeb comes in two sizes: 4B and 8B parameters, built on the Molmo2 architecture with Qwen3 as the language model and SigLIP2 as the vision encoder. Despite their compact size, both models significantly outperform the previous best open-source web agent (Fara-7B) across all tested benchmarks.

On WebVoyager, which tests navigation across 15 popular sites including GitHub and Google Flights, the 8B model achieves 78.2% success rate—only 1.1 percentage points behind OpenAI's o3 at 79.3%. On DeepShop, MolmoWeb-8B trails GPT-5 by only 6 points. Both models beat several larger proprietary agents built on GPT-4o that had access to annotated screenshots and structured page data.

Performance improves substantially with inference-time compute: running tasks multiple times and selecting the best result (pass@4) increases WebVoyager success from 78.2% to 94.7%.

Training Approach and Dataset

The model was trained using supervised fine-tuning on 64 H100 GPUs with no reinforcement learning or distillation from proprietary systems. Training relied on a novel hybrid approach combining human demonstrations with automatically generated runs.

MolmoWeb's primary contribution may be MolmoWebMix, described as the largest public dataset of human web task execution available. The dataset includes:

  • 36,000 complete task runs across 1,100+ websites from crowdworkers
  • Over 2.2 million screenshot-question-answer pairs for web content understanding
  • Over 7 million UI element localization examples
  • Automatically generated synthetic runs using a three-role system: Gemini 2.5 Flash as planner, an operator for browser actions, and GPT-4o as verifier

A counterintuitive finding: synthetic browsing runs outperformed human demonstrations on identical tasks. Researchers attribute this to humans taking detours on unfamiliar sites while automated agents find more direct paths. Data ablations show that just 10% of the dataset delivers 85-90% of final performance.

How MolmoWeb Operates

Unlike proprietary web agents that access page source code or DOM structure, MolmoWeb works exclusively with what humans see on screen—receiving only screenshots, task descriptions, and action history. The agent formulates a thought, then performs the next action: clicking, scrolling, typing, switching tabs, or entering URLs. It then captures a new screenshot and repeats.

AI2 argues this screenshot-only approach is more robust since visual appearance changes less frequently than underlying code, and makes agent reasoning easier to audit and follow.

Limitations and Safety Measures

MolmoWeb cannot reliably read all text in screenshots and performance degrades with vague instructions or multiple constraints. The team deliberately excluded tasks requiring logins or financial transactions from training data.

The hosted demo enforces guardrails: blocking password and credit card fields, limiting access to certain websites, and using Google's interface for content screening. These restrictions apply to the demo only, not the model itself.

AI2 acknowledges unanswered questions remain: how web agents should handle terms of service, prevent access to illegal content, and avoid taking irreversible actions. The team argues full openness enables broader community collaboration on these safety challenges.

What This Means

MolmoWeb demonstrates that open-source web agents can match near-proprietary performance using substantially smaller models and public training data. By releasing the dataset, model weights, and code, AI2 eliminates the primary bottleneck preventing open-source progress in web automation. The 78.2% WebVoyager score with an 8B model versus OpenAI's proprietary systems shows diminishing returns to scale in this domain. However, questions about responsible deployment—especially around financial transactions, authentication, and unauthorized access—remain critical unresolved issues that openness alone doesn't solve.

Related Articles

model release

Mistral releases Leanstral, open-source 6B-parameter proof assistant for Lean 4 under Apache 2.0

Mistral AI has released Leanstral, a sparse 120B model with 6B active parameters designed specifically for the Lean 4 proof assistant. The model is available under Apache 2.0 license with free API access and achieves a 26.3 FLTEval score at pass@2, outperforming Claude Sonnet 4.6 while costing $36 versus $549.

model release

Z.AI releases GLM-5.2 with 1M token context, outperforms GPT-5.5 on long-horizon coding benchmarks

Z.AI has released GLM-5.2, an open-source model with a 1M-token context window under an MIT license. On FrontierSWE, a long-horizon coding benchmark, GLM-5.2 trails Claude Opus 4.8 by 1% while outperforming GPT-5.5 by 1%, and achieves 81.0 on Terminal-Bench 2.1 compared to Opus 4.8's 85.0.

model release

Baidu Releases Unlimited-OCR, a 3B Parameter Document Parsing Model Based on Deepseek-OCR

Baidu has released Unlimited-OCR, a 3 billion parameter model for optical character recognition and document parsing. The model supports single-page and multi-page document processing with a 32,768 token context window and runs on NVIDIA GPUs using bfloat16 precision.

model release

Poolside releases Laguna M.1: 225B parameter MoE model scores 74.6% on SWE-bench Verified

Poolside has released Laguna M.1, a 225B total parameter Mixture-of-Experts model with 23B activated parameters per token, designed for agentic coding tasks. The model scores 74.6% on SWE-bench Verified and 63.1% on SWE-bench Multilingual, released under Apache 2.0 license.

Comments

Loading...