Zhipu AI releases GLM-5V-Turbo: multimodal model generates front-end code from design mockups
Zhipu AI released GLM-5V-Turbo, a multimodal coding model that converts design mockups directly into executable front-end code. The model processes images, video, and text with a 200,000-token context window and 128,000-token max output, priced at $1.20 per million input tokens and $4 per million output tokens.
GLM-5V-Turbo — Quick Specs
Zhipu AI Releases GLM-5V-Turbo: Multimodal Model Converts Design Mockups to Code
Zhipu AI has released GLM-5V-Turbo, a multimodal coding base model that generates executable front-end code directly from design mockups, images, and video inputs. The model is purpose-built for agent workflows and available via API at $1.20 per million input tokens and $4 per million output tokens—pricing identical to the text-only GLM-5-Turbo.
Core Specifications
GLM-5V-Turbo processes multimodal inputs through a proprietary vision encoder called CogViT, integrated directly into the architecture rather than bolted on after training. The model features:
- Context window: 200,000 tokens
- Maximum output: 128,000 tokens
- Key features: Thinking mode, streaming output, function calling, and context caching
- Availability: API-only through Z.AI platform; no open weights announced
Architecture and Training
Zhipu AI claims performance gains stem from four improvements: integrated architecture that processes images and text together from training start; a new vision encoder (CogViT); multi-token prediction during inference for faster output; and reinforcement learning across 30+ task types including STEM, grounding, video, GUI agents, and coding agents.
The company constructed a multi-level, controllable data system to address agent training data shortages, with agentic meta-skills embedded in pre-training. A multimodal toolchain extends capabilities from text to visual interaction, including box drawing, screenshots, website reading, and image understanding.
Claimed Benchmark Performance
According to Zhipu AI, GLM-5V-Turbo leads in most multimodal coding and tool usage benchmarks. The model reportedly scores well on:
- Design-to-code generation and visual code generation
- Multimodal search and visual exploration
- AndroidWorld and WebVoyager (real GUI navigation benchmarks)
- PinchBench, ClawEval, and ZClawBench (task execution quality)
Clause Opus 4.6 reportedly outperforms GLM-5V-Turbo on some benchmarks including Flame-VLM-Code and OSWorld. In text-only coding tasks, the company claims no performance drop despite added visual capabilities, maintaining strength across CC-Bench-V2 (backend, frontend, repo exploration) while outperforming its text-only predecessor GLM-5-Turbo and competitors Kimi K2.5 in several categories.
Important note: Independent evaluations are still pending. All performance claims come directly from Zhipu AI.
Use Cases
GLM-5V-Turbo targets specific workflows:
- Design-to-code: Converts design mockups into complete, runnable front-end projects with pixel-perfect visual consistency
- Autonomous GUI exploration: Paired with Claude Code or OpenClaw, the model can search websites independently, map page transitions, collect visual assets, and write code
- Debugging: Screenshots broken pages, identifies rendering issues (layout shifts, overlaps, color mismatches), and generates fixes
The model integrates with OpenClaw agent framework and includes official skills like image captioning, visual grounding, document writing, resume screening, and prompt generation via ClawHub.
Context: GLM-5 Lineage
GLM-5V-Turbo builds on Zhipu AI's recent releases. GLM-5-Turbo (text-only) launched for the OpenClaw ecosystem, improving tool calls and long task chain execution. Before that, GLM-5—an open-source 744-billion-parameter model under MIT license—launched in February. According to Zhipu, GLM-5 achieved 77.8% on SWE-bench Verified (compared to Claude Opus 4.5's 80.9%) and runs on Huawei chips alongside Nvidia GPUs, an advantage given US export restrictions on semiconductors to China.
What This Means
GLM-5V-Turbo represents a direct technical pivot toward vision-integrated code generation, eliminating the intermediate step of converting design visuals to text descriptions before coding. The model's integration into agent frameworks (Claude Code, OpenClaw) and matching API pricing to text-only models signals Zhipu AI's confidence in visual capabilities not degrading pure text performance. However, performance claims remain unvalidated by independent benchmarking. The design-to-code capability specifically targets a concrete workflow gap in front-end development, though real-world execution quality (pixel accuracy, responsive design handling) requires independent verification beyond company claims.
Related Articles
Google DeepMind releases Gemma 4 family with 256K context window and multimodal capabilities
Google DeepMind released the Gemma 4 family of open-weights models in four sizes (2.3B to 31B parameters) with multimodal support for text, images, video, and audio. The flagship 31B model achieves 85.2% on MMLU Pro and 89.2% on AIME 2024, with context windows up to 256K tokens. All models feature configurable reasoning modes and are optimized for deployment from mobile devices to servers under Apache 2.0 license.
Google DeepMind releases Gemma 4 with 4 model sizes, 256K context, and multimodal reasoning
Google DeepMind released Gemma 4, a family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (3.8B active), and 31B (30.7B parameters). All models support text and image input with 128K-256K context windows, while E2B and E4B add native audio capabilities and reasoning modes across 140+ languages.
Alibaba releases Qwen 3.6 Plus with 1M context window, free tier now available
Alibaba's Qwen division released Qwen 3.6 Plus on April 2, 2026, offering free access to a model with a 1,000,000 token context window. The model combines linear attention with sparse mixture-of-experts routing and achieves a 78.8 score on SWE-bench Verified for software engineering tasks.
Z.ai releases GLM-5V Turbo, native multimodal model for vision-based coding
Z.ai has released GLM-5V Turbo, a native multimodal foundation model designed for vision-based coding and agent-driven tasks. The model supports image, video, and text inputs with a 202,752 token context window, priced at $1.20 per million input tokens and $4 per million output tokens.
Comments
Loading...