multimodal

18 articles tagged with multimodal

March 24, 2026

Google DeepMind's Gemini 3.1 Flash-Lite generates websites in real time, 2.5x faster than predecessor

Google DeepMind released Gemini 3.1 Flash-Lite, a model that generates functional websites in real time through a new pseudo-browser demo. The model achieves first response token 2.5 times faster than Gemini 2.5 Flash and outputs over 360 tokens per second, though output pricing has tripled from $0.40 to $1.50 per million tokens.

model releaseStability AI

Stable Video 4D 2.0 generates 4D assets from single videos with improved quality

Stability AI has released Stable Video 4D 2.0 (SV4D 2.0), an upgraded version of its multi-view video diffusion model designed to generate 4D assets from single object-centric videos. The update claims to deliver higher-quality outputs on real-world video footage.

March 23, 2026
model releaseMicrosoft

Microsoft's superintelligence team releases MAI-Image-2, ranks third in text-to-image generation

Microsoft's superintelligence team, led by Mustafa Suleyman, has released MAI-Image-2, a text-to-image generator that currently ranks third on the Arena.ai leaderboard for text-to-image models, behind OpenAI's GPT-Image-1.5 and Google's Nano Banana 2. The model is now available for testing in the MAI Playground and will roll out to Copilot and Bing Image Creator, with API access opening to all developers through Microsoft Foundry.

model releaseNVIDIA

NVIDIA releases Nemotron 3 Content Safety 4B for multimodal, multilingual moderation

NVIDIA released Nemotron 3 Content Safety 4B, an open-source multimodal safety model designed to moderate content across text, images, and multiple languages. Built on Gemma-3 4B-IT with a 128K context window, the model achieved 84% average accuracy on multimodal safety benchmarks and supports over 140 languages through culturally-aware training data.

model releaseXiaomi

Xiaomi launches MiMo-V2-Pro with 1T parameters, matches Claude Opus on coding at 80% lower cost

Xiaomi shipped three AI models simultaneously designed to form a complete agent platform. MiMo-V2-Pro, a 1-trillion-parameter Mixture-of-Experts model with 42 billion active parameters per request, scores 78% on SWE-bench Verified and 81 points on ClawEval—nearly matching Claude Opus 4.6 while costing $1 per million input tokens versus $5 for Opus.

March 18, 2026
model releaseOpenAI

OpenAI releases GPT-4o mini with 128K context at $0.15/$0.60 per 1M tokens

OpenAI released GPT-4o mini on July 18, 2024, a compact multimodal model with 128,000 token context window priced at $0.15 per million input tokens and $0.60 per million output tokens. The model achieves 82% on MMLU and claims to rank higher than GPT-4 on chat preference leaderboards while costing 60% less than GPT-3.5 Turbo.

March 11, 2026
model release

Google's Gemini Embedding 2 unifies text, image, video, and audio in single vector space

Google has released Gemini Embedding 2, its first native multimodal embedding model that represents text, images, video, audio, and documents in a unified vector space. The model eliminates the need for separate embedding models across different modalities in AI pipelines.

product updateOpenAI

OpenAI plans to integrate Sora video generator directly into ChatGPT

OpenAI plans to integrate its Sora video generator as a built-in feature within ChatGPT, according to The Information. Currently available only on a standalone website and app, the integration would let users generate videos directly in the chatbot, similar to how image generation was added last year.

March 8, 2026
research

Meta research challenges multimodal training assumptions as text data scarcity looms

A Meta FAIR and New York University research team trained a multimodal AI model from scratch and identified that several widely-held assumptions about multimodal model architecture and training don't align with their empirical findings. The work addresses growing concerns about text data exhaustion in LLM training.

March 2, 2026
model release

Alibaba releases Qwen3.5-2B, a 2B-parameter multimodal model for image and text tasks

Alibaba has released Qwen3.5-2B, a 2-billion-parameter multimodal model capable of processing both images and text. The model is available on Hugging Face under the Apache 2.0 license and supports image-text-to-text tasks.

model release

Alibaba releases Qwen3.5-0.8B, a compact multimodal model for edge deployment

Alibaba's Qwen team has released Qwen3.5-0.8B, an 800-million-parameter multimodal model designed for resource-constrained environments. The model handles image-text-to-text tasks and is distributed under Apache 2.0 licensing, making it freely usable for commercial applications.

model release

Alibaba releases Qwen3.5-4B, a 4B multimodal model for vision and text tasks

Alibaba's Qwen team has released Qwen3.5-4B, a 4 billion parameter multimodal model capable of processing both images and text. The model is available on Hugging Face under an Apache 2.0 license, making it freely available for commercial and research use.

model release

Alibaba releases Qwen3.5-9B, a multimodal 9B parameter model

Alibaba has released Qwen3.5-9B, a 9-billion parameter multimodal language model capable of processing both images and text. The model is available under Apache 2.0 license on Hugging Face with transformer-compatible architecture.

March 1, 2026
model release

Alibaba releases Qwen3.5-35B-A3B-FP8, a quantized multimodal model for efficient deployment

Alibaba's Qwen team released Qwen3.5-35B-A3B-FP8 on Hugging Face, a quantized version of their 35-billion parameter multimodal model. The FP8 quantization reduces model size and memory requirements while maintaining the base model's image-text-to-text capabilities. The model is compatible with standard Transformers endpoints and Azure deployment.

February 26, 2026
model release

Alibaba releases Qwen3.5-35B-A3B, a 35B multimodal model with Apache 2.0 license

Alibaba's Qwen team has released Qwen3.5-35B-A3B-Base, a 35-billion parameter multimodal model supporting image-text-to-text tasks. The model is available under the Apache 2.0 license and compatible with major inference endpoints including Azure deployment.

February 24, 2026
model release

Alibaba releases Qwen3.5-27B, a 27B multimodal model with Apache 2.0 license

Alibaba Qwen has released Qwen3.5-27B, a 27-billion parameter model capable of processing both images and text. The model is available under an Apache 2.0 open license and is compatible with standard transformer endpoints.

model release

Alibaba releases Qwen3.5-35B-A3B, a 35B multimodal model with Apache 2.0 license

Alibaba has released Qwen3.5-35B-A3B, a 35-billion parameter multimodal model capable of processing images and text. The model is published under an Apache 2.0 license and available on Hugging Face with Transformers and SafeTensors format support.

February 20, 2026
product update

Google's Gemini adds Lyria 3 music generation from text and images

Google has integrated Lyria 3, its music generation model, directly into the Gemini app. Users can now create custom 30-second music tracks from text descriptions and images without additional tools or subscriptions.