vision

8 articles tagged with vision

June 9, 2026

Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM

Google DeepMind has released Gemma 4 12B, a 12-billion parameter multimodal model that runs locally on laptops with 16GB of RAM. The model eliminates separate vision and audio encoders, processing raw inputs directly through its language model backbone under an Apache 2.0 license.

June 9, 2026 · 2:21 PM

June 3, 2026

model releaseGoogle DeepMind

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

Google DeepMind has released Gemma 4 12B Unified, an encoder-free multimodal model that processes text, images, and audio through a single decoder-only transformer. The model features 11.95 billion parameters, a 256K token context window, and achieves 77.2% on MMLU Pro and 72.0% on LiveCodeBench v6.

June 3, 2026 · 5:51 PM

April 28, 2026

model releaseXiaomi

Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window

Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window and was trained on 48T tokens using FP8 mixed precision.

April 28, 2026 · 1:06 AM

April 16, 2026

model releaseAnthropic

Anthropic releases Claude Opus 4.7 with improved coding and vision, confirms it trails unreleased Mythos model

Anthropic released Claude Opus 4.7 with improved coding capabilities, higher-resolution vision, and a new reasoning level. The company publicly acknowledged the model underperforms its unreleased Mythos system, which remains restricted due to safety concerns.

April 16, 2026 · 4:36 PM

April 6, 2026

model releaseGoogle DeepMind

Google DeepMind releases Gemma 4 family: multimodal models from 2.3B to 31B parameters with 256K context

Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active parameters), and 31B dense. All models support text and image input with 128K-256K context windows; E2B and E4B add native audio capabilities. Models feature reasoning modes, function calling, and multilingual support across 140+ languages.

April 6, 2026 · 9:05 AM

April 2, 2026

model releaseGoogle DeepMind

Google DeepMind releases Gemma 4 open models with multimodal capabilities and 256K context window

Google DeepMind released the Gemma 4 family of open-source models with multimodal capabilities (text, image, audio, video) and context windows up to 256K tokens. Four distinct model sizes—E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active), and 31B—are available under the Apache 2.0 license, with instruction-tuned and pre-trained variants.

April 2, 2026 · 7:05 PM

model release

Google releases Gemma 4 family with 31B model, 256K context, multimodal capabilities

Google DeepMind released the Gemma 4 family of open-weights models ranging from 2.3B to 31B parameters, featuring up to 256K token context windows and native support for text, image, video, and audio inputs. The flagship 31B model scores 85.2% on MMLU Pro and 89.2% on AIME 2026, with a smaller 26B MoE variant requiring only 3.8B active parameters for faster inference.

April 2, 2026 · 5:05 PM

April 1, 2026

model release

Z.ai releases GLM-5V Turbo, native multimodal model for vision-based coding

Z.ai has released GLM-5V Turbo, a native multimodal foundation model designed for vision-based coding and agent-driven tasks. The model supports image, video, and text inputs with a 202,752 token context window, priced at $1.20 per million input tokens and $4 per million output tokens.

April 1, 2026 · 5:20 PM

← Back to all news