multimodal
18 articles tagged with multimodal
Google DeepMind's Gemini 3.1 Flash-Lite generates websites in real time, 2.5x faster than predecessor
Google DeepMind released Gemini 3.1 Flash-Lite, a model that generates functional websites in real time through a new pseudo-browser demo. The model achieves first response token 2.5 times faster than Gemini 2.5 Flash and outputs over 360 tokens per second, though output pricing has tripled from $0.40 to $1.50 per million tokens.
Stable Video 4D 2.0 generates 4D assets from single videos with improved quality
Stability AI has released Stable Video 4D 2.0 (SV4D 2.0), an upgraded version of its multi-view video diffusion model designed to generate 4D assets from single object-centric videos. The update claims to deliver higher-quality outputs on real-world video footage.
Microsoft's superintelligence team releases MAI-Image-2, ranks third in text-to-image generation
Microsoft's superintelligence team, led by Mustafa Suleyman, has released MAI-Image-2, a text-to-image generator that currently ranks third on the Arena.ai leaderboard for text-to-image models, behind OpenAI's GPT-Image-1.5 and Google's Nano Banana 2. The model is now available for testing in the MAI Playground and will roll out to Copilot and Bing Image Creator, with API access opening to all developers through Microsoft Foundry.
NVIDIA releases Nemotron 3 Content Safety 4B for multimodal, multilingual moderation
NVIDIA released Nemotron 3 Content Safety 4B, an open-source multimodal safety model designed to moderate content across text, images, and multiple languages. Built on Gemma-3 4B-IT with a 128K context window, the model achieved 84% average accuracy on multimodal safety benchmarks and supports over 140 languages through culturally-aware training data.
Xiaomi launches MiMo-V2-Pro with 1T parameters, matches Claude Opus on coding at 80% lower cost
Xiaomi shipped three AI models simultaneously designed to form a complete agent platform. MiMo-V2-Pro, a 1-trillion-parameter Mixture-of-Experts model with 42 billion active parameters per request, scores 78% on SWE-bench Verified and 81 points on ClawEval—nearly matching Claude Opus 4.6 while costing $1 per million input tokens versus $5 for Opus.
OpenAI releases GPT-4o mini with 128K context at $0.15/$0.60 per 1M tokens
OpenAI released GPT-4o mini on July 18, 2024, a compact multimodal model with 128,000 token context window priced at $0.15 per million input tokens and $0.60 per million output tokens. The model achieves 82% on MMLU and claims to rank higher than GPT-4 on chat preference leaderboards while costing 60% less than GPT-3.5 Turbo.
Google's Gemini Embedding 2 unifies text, image, video, and audio in single vector space
Google has released Gemini Embedding 2, its first native multimodal embedding model that represents text, images, video, audio, and documents in a unified vector space. The model eliminates the need for separate embedding models across different modalities in AI pipelines.
OpenAI plans to integrate Sora video generator directly into ChatGPT
OpenAI plans to integrate its Sora video generator as a built-in feature within ChatGPT, according to The Information. Currently available only on a standalone website and app, the integration would let users generate videos directly in the chatbot, similar to how image generation was added last year.
Meta research challenges multimodal training assumptions as text data scarcity looms
A Meta FAIR and New York University research team trained a multimodal AI model from scratch and identified that several widely-held assumptions about multimodal model architecture and training don't align with their empirical findings. The work addresses growing concerns about text data exhaustion in LLM training.
Alibaba releases Qwen3.5-2B, a 2B-parameter multimodal model for image and text tasks
Alibaba has released Qwen3.5-2B, a 2-billion-parameter multimodal model capable of processing both images and text. The model is available on Hugging Face under the Apache 2.0 license and supports image-text-to-text tasks.
Alibaba releases Qwen3.5-0.8B, a compact multimodal model for edge deployment
Alibaba's Qwen team has released Qwen3.5-0.8B, an 800-million-parameter multimodal model designed for resource-constrained environments. The model handles image-text-to-text tasks and is distributed under Apache 2.0 licensing, making it freely usable for commercial applications.
Alibaba releases Qwen3.5-4B, a 4B multimodal model for vision and text tasks
Alibaba's Qwen team has released Qwen3.5-4B, a 4 billion parameter multimodal model capable of processing both images and text. The model is available on Hugging Face under an Apache 2.0 license, making it freely available for commercial and research use.
Alibaba releases Qwen3.5-9B, a multimodal 9B parameter model
Alibaba has released Qwen3.5-9B, a 9-billion parameter multimodal language model capable of processing both images and text. The model is available under Apache 2.0 license on Hugging Face with transformer-compatible architecture.
Alibaba releases Qwen3.5-35B-A3B-FP8, a quantized multimodal model for efficient deployment
Alibaba's Qwen team released Qwen3.5-35B-A3B-FP8 on Hugging Face, a quantized version of their 35-billion parameter multimodal model. The FP8 quantization reduces model size and memory requirements while maintaining the base model's image-text-to-text capabilities. The model is compatible with standard Transformers endpoints and Azure deployment.
Alibaba releases Qwen3.5-35B-A3B, a 35B multimodal model with Apache 2.0 license
Alibaba's Qwen team has released Qwen3.5-35B-A3B-Base, a 35-billion parameter multimodal model supporting image-text-to-text tasks. The model is available under the Apache 2.0 license and compatible with major inference endpoints including Azure deployment.
Alibaba releases Qwen3.5-27B, a 27B multimodal model with Apache 2.0 license
Alibaba Qwen has released Qwen3.5-27B, a 27-billion parameter model capable of processing both images and text. The model is available under an Apache 2.0 open license and is compatible with standard transformer endpoints.
Alibaba releases Qwen3.5-35B-A3B, a 35B multimodal model with Apache 2.0 license
Alibaba has released Qwen3.5-35B-A3B, a 35-billion parameter multimodal model capable of processing images and text. The model is published under an Apache 2.0 license and available on Hugging Face with Transformers and SafeTensors format support.
Google's Gemini adds Lyria 3 music generation from text and images
Google has integrated Lyria 3, its music generation model, directly into the Gemini app. Users can now create custom 30-second music tracks from text descriptions and images without additional tools or subscriptions.