model release

AI2 releases MolmoWeb, open web agent matching proprietary systems with 8B parameters

TL;DR

The Allen Institute for AI has released MolmoWeb, a fully open web agent that operates websites using only screenshots without access to source code. The 8B-parameter model achieves 78.2% success on WebVoyager—nearly matching OpenAI's o3 at 79.3%—while being trained on one of the largest public web task datasets ever released.

3 min read
0

AI2 Releases MolmoWeb: Fully Open Web Agent Matching Proprietary Performance

The Allen Institute for AI (AI2) has released MolmoWeb, a fully open web agent that navigates websites using only screenshots, with all training data, model weights, and evaluation tools freely available under Apache 2.0 license.

Model Specifications and Performance

MolmoWeb comes in two sizes: 4B and 8B parameters, built on the Molmo2 architecture with Qwen3 as the language model and SigLIP2 as the vision encoder. Despite their compact size, both models significantly outperform the previous best open-source web agent (Fara-7B) across all tested benchmarks.

On WebVoyager, which tests navigation across 15 popular sites including GitHub and Google Flights, the 8B model achieves 78.2% success rate—only 1.1 percentage points behind OpenAI's o3 at 79.3%. On DeepShop, MolmoWeb-8B trails GPT-5 by only 6 points. Both models beat several larger proprietary agents built on GPT-4o that had access to annotated screenshots and structured page data.

Performance improves substantially with inference-time compute: running tasks multiple times and selecting the best result (pass@4) increases WebVoyager success from 78.2% to 94.7%.

Training Approach and Dataset

The model was trained using supervised fine-tuning on 64 H100 GPUs with no reinforcement learning or distillation from proprietary systems. Training relied on a novel hybrid approach combining human demonstrations with automatically generated runs.

MolmoWeb's primary contribution may be MolmoWebMix, described as the largest public dataset of human web task execution available. The dataset includes:

  • 36,000 complete task runs across 1,100+ websites from crowdworkers
  • Over 2.2 million screenshot-question-answer pairs for web content understanding
  • Over 7 million UI element localization examples
  • Automatically generated synthetic runs using a three-role system: Gemini 2.5 Flash as planner, an operator for browser actions, and GPT-4o as verifier

A counterintuitive finding: synthetic browsing runs outperformed human demonstrations on identical tasks. Researchers attribute this to humans taking detours on unfamiliar sites while automated agents find more direct paths. Data ablations show that just 10% of the dataset delivers 85-90% of final performance.

How MolmoWeb Operates

Unlike proprietary web agents that access page source code or DOM structure, MolmoWeb works exclusively with what humans see on screen—receiving only screenshots, task descriptions, and action history. The agent formulates a thought, then performs the next action: clicking, scrolling, typing, switching tabs, or entering URLs. It then captures a new screenshot and repeats.

AI2 argues this screenshot-only approach is more robust since visual appearance changes less frequently than underlying code, and makes agent reasoning easier to audit and follow.

Limitations and Safety Measures

MolmoWeb cannot reliably read all text in screenshots and performance degrades with vague instructions or multiple constraints. The team deliberately excluded tasks requiring logins or financial transactions from training data.

The hosted demo enforces guardrails: blocking password and credit card fields, limiting access to certain websites, and using Google's interface for content screening. These restrictions apply to the demo only, not the model itself.

AI2 acknowledges unanswered questions remain: how web agents should handle terms of service, prevent access to illegal content, and avoid taking irreversible actions. The team argues full openness enables broader community collaboration on these safety challenges.

What This Means

MolmoWeb demonstrates that open-source web agents can match near-proprietary performance using substantially smaller models and public training data. By releasing the dataset, model weights, and code, AI2 eliminates the primary bottleneck preventing open-source progress in web automation. The 78.2% WebVoyager score with an 8B model versus OpenAI's proprietary systems shows diminishing returns to scale in this domain. However, questions about responsible deployment—especially around financial transactions, authentication, and unauthorized access—remain critical unresolved issues that openness alone doesn't solve.

Related Articles

model release

Allen Institute releases EMO, 14B parameter MoE model with selective 12.5% expert use

Allen Institute for AI released EMO, a 1B-active, 14B-total-parameter mixture-of-experts model trained on 1 trillion tokens. The model uses 8 active experts per token from a pool of 128 total experts, and can maintain near full-model performance while using just 12.5% of its experts for specific tasks.

model release

IBM Releases Granite Embedding 311M R2 With 32K Context, 200+ Language Support

IBM released Granite Embedding 311M Multilingual R2, a 311-million parameter dense embedding model with 32,768-token context length and support for 200+ languages. The model scores 64.0 on Multilingual MTEB Retrieval (18 tasks), an 11.8-point improvement over its predecessor, and ships with ONNX and OpenVINO models for production deployment.

model release

IBM releases Apache 2.0 Granite 4.1 LLMs in 3B, 8B, and 30B sizes

IBM has released the Granite 4.1 family of language models under Apache 2.0 license. The models come in 3B, 8B, and 30B parameter sizes. Unsloth has released 21 GGUF quantized variants of the 3B model ranging from 1.2GB to 6.34GB.

model release

IBM Releases Granite 4.1 30B With 131K Context Window and Enhanced Tool-Calling

IBM released Granite 4.1 30B, a 30-billion parameter instruction-following model with a 131,072 token context window. The model scores 80.16 on MMLU 5-shot and 88.41 on HumanEval pass@1, with enhanced tool-calling capabilities following OpenAI's function definition schema.

Comments

Loading...