model releaseZhipu AI

Zhipu AI releases GLM-5V-Turbo: multimodal model generates front-end code from design mockups

TL;DR

Zhipu AI released GLM-5V-Turbo, a multimodal coding model that converts design mockups directly into executable front-end code. The model processes images, video, and text with a 200,000-token context window and 128,000-token max output, priced at $1.20 per million input tokens and $4 per million output tokens.

April 3, 2026 · 12:20 PM3 min read

GLM-5V-Turbo — Quick Specs

Context window200K tokens

Input$1.2/1M tokens

Output$4/1M tokens

Compare GLM-5V-Turbo with other models →

Zhipu AI Releases GLM-5V-Turbo: Multimodal Model Converts Design Mockups to Code

Zhipu AI has released GLM-5V-Turbo, a multimodal coding base model that generates executable front-end code directly from design mockups, images, and video inputs. The model is purpose-built for agent workflows and available via API at $1.20 per million input tokens and $4 per million output tokens—pricing identical to the text-only GLM-5-Turbo.

Core Specifications

GLM-5V-Turbo processes multimodal inputs through a proprietary vision encoder called CogViT, integrated directly into the architecture rather than bolted on after training. The model features:

Context window: 200,000 tokens
Maximum output: 128,000 tokens
Key features: Thinking mode, streaming output, function calling, and context caching
Availability: API-only through Z.AI platform; no open weights announced

Architecture and Training

Zhipu AI claims performance gains stem from four improvements: integrated architecture that processes images and text together from training start; a new vision encoder (CogViT); multi-token prediction during inference for faster output; and reinforcement learning across 30+ task types including STEM, grounding, video, GUI agents, and coding agents.

The company constructed a multi-level, controllable data system to address agent training data shortages, with agentic meta-skills embedded in pre-training. A multimodal toolchain extends capabilities from text to visual interaction, including box drawing, screenshots, website reading, and image understanding.

Claimed Benchmark Performance

According to Zhipu AI, GLM-5V-Turbo leads in most multimodal coding and tool usage benchmarks. The model reportedly scores well on:

Design-to-code generation and visual code generation
Multimodal search and visual exploration
AndroidWorld and WebVoyager (real GUI navigation benchmarks)
PinchBench, ClawEval, and ZClawBench (task execution quality)

Clause Opus 4.6 reportedly outperforms GLM-5V-Turbo on some benchmarks including Flame-VLM-Code and OSWorld. In text-only coding tasks, the company claims no performance drop despite added visual capabilities, maintaining strength across CC-Bench-V2 (backend, frontend, repo exploration) while outperforming its text-only predecessor GLM-5-Turbo and competitors Kimi K2.5 in several categories.

Important note: Independent evaluations are still pending. All performance claims come directly from Zhipu AI.

Use Cases

GLM-5V-Turbo targets specific workflows:

Design-to-code: Converts design mockups into complete, runnable front-end projects with pixel-perfect visual consistency
Autonomous GUI exploration: Paired with Claude Code or OpenClaw, the model can search websites independently, map page transitions, collect visual assets, and write code
Debugging: Screenshots broken pages, identifies rendering issues (layout shifts, overlaps, color mismatches), and generates fixes

The model integrates with OpenClaw agent framework and includes official skills like image captioning, visual grounding, document writing, resume screening, and prompt generation via ClawHub.

Context: GLM-5 Lineage

GLM-5V-Turbo builds on Zhipu AI's recent releases. GLM-5-Turbo (text-only) launched for the OpenClaw ecosystem, improving tool calls and long task chain execution. Before that, GLM-5—an open-source 744-billion-parameter model under MIT license—launched in February. According to Zhipu, GLM-5 achieved 77.8% on SWE-bench Verified (compared to Claude Opus 4.5's 80.9%) and runs on Huawei chips alongside Nvidia GPUs, an advantage given US export restrictions on semiconductors to China.

What This Means

GLM-5V-Turbo represents a direct technical pivot toward vision-integrated code generation, eliminating the intermediate step of converting design visuals to text descriptions before coding. The model's integration into agent frameworks (Claude Code, OpenClaw) and matching API pricing to text-only models signals Zhipu AI's confidence in visual capabilities not degrading pure text performance. However, performance claims remain unvalidated by independent benchmarking. The design-to-code capability specifically targets a concrete workflow gap in front-end development, though real-world execution quality (pixel accuracy, responsive design handling) requires independent verification beyond company claims.

Source: the-decoder.com ↗

zhipu-ai multimodal-models code-generation agent-frameworks vision-language design-to-code glm-5v-turbo chinese-ai

model releaseMay 12, 2026

Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens

Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.

model releaseMay 20, 2026

xAI Launches Grok Build 0.1: Coding Model with 256K Context for Agentic Workflows

xAI has released Grok Build 0.1, a coding-specialized model with a 256K context window and unlimited text output. The model is designed for agentic software engineering workflows and powers xAI's Grok Build CLI tool.

model releaseMay 20, 2026

Stability AI Releases Stable Audio 3.0 Model Family Trained on Licensed Data

Stability AI has released Stable Audio 3.0, a model family for audio generation trained on fully licensed data. The company positions the release as a foundation for commercial audio applications, though specific technical specifications have not yet been disclosed.

model releaseMay 20, 2026

Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis

Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.

Zhipu AI releases GLM-5V-Turbo: multimodal model generates front-end code from design mockups

GLM-5V-Turbo — Quick Specs

Zhipu AI Releases GLM-5V-Turbo: Multimodal Model Converts Design Mockups to Code

Core Specifications

Architecture and Training

Claimed Benchmark Performance

Use Cases

Context: GLM-5 Lineage

What This Means

Related Articles

Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens

xAI Launches Grok Build 0.1: Coding Model with 256K Context for Agentic Workflows

Stability AI Releases Stable Audio 3.0 Model Family Trained on Licensed Data

Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis

Comments