multimodal-models
4 articles tagged with multimodal-models
AI models guess instead of asking for help, ProactiveBench study shows
Researchers introduced ProactiveBench, a benchmark testing whether multimodal language models ask for help when visual information is missing. Out of 22 models tested—including GPT-4.1, GPT-5.2, and o4-mini—almost none proactively request clarification, instead hallucinating or refusing to respond. A reinforcement learning approach showed models can be trained to ask for help, improving performance from 17.5% to 37-38%, though significant gaps remain.
Zhipu AI releases GLM-5V-Turbo: multimodal model generates front-end code from design mockups
Zhipu AI released GLM-5V-Turbo, a multimodal coding model that converts design mockups directly into executable front-end code. The model processes images, video, and text with a 200,000-token context window and 128,000-token max output, priced at $1.20 per million input tokens and $4 per million output tokens.
Google DeepMind releases Gemma 4 family with 256K context window and multimodal capabilities
Google DeepMind released the Gemma 4 family of open-weights models in four sizes (2.3B to 31B parameters) with multimodal support for text, images, video, and audio. The flagship 31B model achieves 85.2% on MMLU Pro and 89.2% on AIME 2024, with context windows up to 256K tokens. All models feature configurable reasoning modes and are optimized for deployment from mobile devices to servers under Apache 2.0 license.
Amazon Bedrock adds three video analysis workflows for multimodal understanding at scale
Amazon Bedrock has introduced three distinct video analysis workflows that leverage multimodal foundation models to extract insights from video content at scale. The approaches—frame-based, shot-based, and multimodal embedding—are designed for different use cases and cost-performance trade-offs, with open-source reference implementations available on GitHub.