model release

Z.ai releases GLM-5V Turbo, native multimodal model for vision-based coding

TL;DR

Z.ai has released GLM-5V Turbo, a native multimodal foundation model designed for vision-based coding and agent-driven tasks. The model supports image, video, and text inputs with a 202,752 token context window, priced at $1.20 per million input tokens and $4 per million output tokens.

April 1, 2026 · 5:20 PM2 min read

GLM-5V-Turbo — Quick Specs

Context window200K tokens

Input$1.2/1M tokens

Output$4/1M tokens

Compare GLM-5V-Turbo with other models →

Z.ai Launches GLM-5V Turbo, Native Multimodal Foundation Model

Z.ai has released GLM-5V Turbo, a multimodal foundation model built as the company's first native multimodal agent capable of handling image, video, and text inputs simultaneously.

Model Specifications

GLM-5V Turbo features a 202,752 token context window and is positioned for vision-heavy applications including coding tasks and autonomous agent workflows. The model operates on a two-tier pricing structure: $1.20 per million input tokens and $4 per million output tokens.

The model is available through OpenRouter and other provider infrastructure, with normalized request/response handling across multiple backend providers.

Capabilities and Use Cases

According to Z.ai, the model excels at:

Long-horizon planning and sequential task execution
Complex coding tasks with visual context
Vision-based agent workflows operating in a "perceive → plan → execute" loop
Integration with autonomous agent systems for end-to-end task completion

The native multimodal architecture enables the model to process images and video frames directly without requiring separate preprocessing steps or external vision encoders.

Technical Details

GLM-5V Turbo is designed specifically for agent-driven applications. The model integrates with reasoning-enabled capabilities through OpenRouter's infrastructure, allowing developers to access step-by-step reasoning processes via the reasoning parameter and reasoning_details array in API responses.

The release date is listed as April 1, 2026, though adoption and usage data remain limited at launch.

Positioning and Competition

The model enters a competitive multimodal landscape alongside Claude 3.5 Sonnet's vision capabilities, GPT-4o's multimodal integration, and other vision-capable models from major labs. Z.ai emphasizes agent-oriented design and native video handling as differentiation points.

What This Means

GLM-5V Turbo represents Z.ai's entry into the multimodal foundation model space with explicit focus on agent-driven workflows. The $1.20/$4 pricing sits in the mid-range for multimodal models, and the 202K context window supports longer visual sequences and planning horizons. Adoption will likely depend on developer experience with agent integration and real-world performance on complex vision-coding tasks relative to established competitors.

Source: openrouter.ai ↗

z-ai glm-5v-turbo multimodal vision foundation-model agent-ai video-processing coding-assistance

model releaseMay 15, 2026

Microsoft Releases Fara-7B: 7B Parameter Computer Use Agent Trained in 2.5 Days on 64 H100s

Microsoft Research has released Fara-7B, a 7-billion parameter small language model designed for computer automation tasks. The model, which took 2.5 days to train on 64 H100 GPUs, can navigate websites to complete tasks like booking restaurants and shopping, using screenshots as input with a 128K token context window.

model releaseMay 14, 2026

Baidu Releases Qianfan-OCR-Fast Model with 66K Context at $0.68 Per 1M Input Tokens

Baidu has released Qianfan-OCR-Fast, a multimodal model specialized for optical character recognition tasks. The model offers a 66,000 token context window and is priced at $0.68 per 1M input tokens and $2.81 per 1M output tokens.

model releaseMay 12, 2026

Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens

Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.

model releaseMay 10, 2026

Google DeepMind Releases Gemma 4 E4B with Multi-Token Prediction for 2x Faster Inference

Google DeepMind released the Gemma 4 E4B assistant model using Multi-Token Prediction (MTP) architecture that accelerates inference by up to 2x through speculative decoding. The 4.5B effective parameter model supports 128K context windows and handles text, image, and audio input with pricing not yet disclosed.