vision-language

8 articles tagged with vision-language

April 28, 2026
model releaseNVIDIA

NVIDIA Releases Nemotron 3 Nano Omni: 30B-A3B Multimodal Model With 100+ Page Document Support

NVIDIA released Nemotron 3 Nano Omni, a 30B-A3B Mixture-of-Experts model that processes text, images, video, and audio. The model uses a hybrid Mamba-Transformer architecture with 128 experts and achieves 65.8 on OCRBenchV2-En and 72.2 on Video-MME, while delivering up to 9x higher throughput on multimodal tasks compared to alternatives.

April 22, 2026
model release

Gemma 4 VLA runs locally on NVIDIA Jetson Orin Nano Super with 8GB RAM, autonomous webcam tool-calling

NVIDIA engineer Asier Arranz demonstrated Gemma 4 running as a vision-language agent (VLA) on a Jetson Orin Nano Super with 8GB RAM. The model autonomously decides when to access a webcam based on user queries, with no hardcoded triggers—performing speech-to-text, vision analysis, and text-to-speech entirely locally.

April 11, 2026
benchmark

AI models guess instead of asking for help, ProactiveBench study shows

Researchers introduced ProactiveBench, a benchmark testing whether multimodal language models ask for help when visual information is missing. Out of 22 models tested—including GPT-4.1, GPT-5.2, and o4-mini—almost none proactively request clarification, instead hallucinating or refusing to respond. A reinforcement learning approach showed models can be trained to ask for help, improving performance from 17.5% to 37-38%, though significant gaps remain.

April 3, 2026
model releaseZhipu AI

Zhipu AI releases GLM-5V-Turbo: multimodal model generates front-end code from design mockups

Zhipu AI released GLM-5V-Turbo, a multimodal coding model that converts design mockups directly into executable front-end code. The model processes images, video, and text with a 200,000-token context window and 128,000-token max output, priced at $1.20 per million input tokens and $4 per million output tokens.

April 1, 2026
model release

UAE's TIIUAE releases Falcon Perception: 0.6B early-fusion model for open-vocabulary grounding

TIIUAE has released Falcon Perception, a 0.6B-parameter early-fusion Transformer that combines image patches and text in a single sequence for open-vocabulary object grounding and segmentation. The model achieves 68.0 Macro-F1 on SA-Co (vs. 62.3 for SAM 3) and introduces PBench, a diagnostic benchmark that isolates performance across five capability levels. TIIUAE also released Falcon OCR, a 0.3B model reaching 80.3 on olmOCR and 88.6 on OmniDocBench.

March 25, 2026
model release

AI2 releases MolmoWeb, open web agent matching proprietary systems with 8B parameters

The Allen Institute for AI has released MolmoWeb, a fully open web agent that operates websites using only screenshots without access to source code. The 8B-parameter model achieves 78.2% success on WebVoyager—nearly matching OpenAI's o3 at 79.3%—while being trained on one of the largest public web task datasets ever released.

model release

Reka releases Reka Edge, a 7B multimodal model for efficient image and video understanding

Reka has released Reka Edge, a 7-billion parameter multimodal model designed for efficient image and video understanding. The model features a 16,384 token context window and is priced at $0.20 per million input and output tokens.

March 2, 2026
model release

Alibaba releases Qwen3.5-4B, a 4B multimodal model for vision and text tasks

Alibaba's Qwen team has released Qwen3.5-4B, a 4 billion parameter multimodal model capable of processing both images and text. The model is available on Hugging Face under an Apache 2.0 license, making it freely available for commercial and research use.