research

WAFFLE fine-tuning improves multimodal models for web development by 9 percentage points

Researchers introduce WAFFLE, a fine-tuning methodology that enhances multimodal models' ability to convert UI designs into HTML code. The approach uses structure-aware attention mechanisms and contrastive learning to bridge the gap between visual UI designs and text-based HTML, achieving up to 9 percentage point improvements on benchmark tasks.

March 5, 2026 · 1:10 AM2 min read

WAFFLE Fine-tuning Strategy Improves UI-to-HTML Code Generation by Up to 9 Percentage Points

Researchers have published a new fine-tuning methodology called WAFFLE that addresses two fundamental challenges in automated front-end development: representing HTML's hierarchical structure effectively and aligning visual UI designs with text-based code.

The Problem

Converting UI designs into functional HTML remains difficult despite advances in large language models. The core issues are structural: HTML uses nested hierarchies that LLMs struggle to represent accurately, and there's a semantic gap between how designers think visually and how developers write code textually.

WAFFLE's Approach

The methodology combines two key innovations:

Structure-aware attention mechanism: Improves how LLMs understand and generate HTML's hierarchical relationships, enabling more accurate parent-child element nesting and proper DOM structure.
Contrastive fine-tuning: Aligns the model's understanding of visual UI images with corresponding HTML code by training on paired examples that emphasize the relationship between design intent and code implementation.

Benchmark Results

Models fine-tuned with WAFFLE demonstrated significant improvements on both new and existing benchmarks:

HTML Match: +9.00 percentage points (direct code correctness)
CW-SSIM: +0.0982 (structural similarity)
CLIP score: +32.99 (visual-semantic alignment)
LLEM: +27.12 percentage points (layout correctness)

These improvements were measured on the researchers' new WebSight-Test benchmark and the existing Design2Code benchmark, both standard evaluation frameworks for UI-to-HTML generation tasks.

Implications for Development Workflow

The research targets a specific but valuable problem in front-end development. As design-to-code automation becomes more prevalent in tools like Figma plugins and browser-based IDEs, improving the accuracy of generated HTML directly reduces developer manual editing time and potential bugs in layout code.

The structure-aware attention mechanism is particularly significant because HTML's nested structure has been a persistent weak point for sequence models. By explicitly modeling this hierarchy during fine-tuning, WAFFLE addresses a fundamental limitation rather than applying general improvements.

What This Means

WAFFLE represents incremental but meaningful progress in converting visual designs to production-quality code. A 9-point improvement in HTML match rates suggests that fine-tuned models could handle a larger percentage of straightforward UI conversions without human intervention. However, the methodology is evaluated on academic benchmarks; real-world performance depends on the complexity and variety of designs encountered in production systems. The work is most immediately applicable to companies building design-to-code tools and may inform future versions of code generation models that handle multimodal input.

Source: arxiv.org ↗

research multimodal-models code-generation fine-tuning frontend-development html-generation ui-automation arxiv