multimodal
50 articles tagged with multimodal
Google releases Gemini 3.1 Flash Lite with 1M context at $0.25 per million input tokens
Google has released Gemini 3.1 Flash Lite, a high-efficiency multimodal model with a 1,048,576 token context window priced at $0.25 per million input tokens and $1.50 per million output tokens. The model supports text, image, video, audio, and PDF inputs with four thinking levels for cost-performance optimization.
Google DeepMind Releases Gemma 4 26B A4B Assistant Model for 2x Faster Inference via Multi-Token Prediction
Google DeepMind has released a Multi-Token Prediction assistant model for Gemma 4 26B A4B that achieves up to 2x decoding speedup through speculative decoding. The model uses 3.8B active parameters from a 25.2B total parameter MoE architecture with 128 experts and a 256K token context window.
Google DeepMind releases Gemma 4 with 31B dense model, 256K context window, and speculative decoding drafters
Google DeepMind has released Gemma 4, a family of open-weight multimodal models including a 31B dense model with 256K context window and four size variants ranging from 2.3B to 30.7B effective parameters. The release includes Multi-Token Prediction (MTP) draft models that achieve up to 2x decoding speedup through speculative decoding while maintaining identical output quality.
NVIDIA releases Nemotron-3-Nano-Omni-30B, a 31B-parameter multimodal model with 256K context and reasoning mode
NVIDIA released Nemotron-3-Nano-Omni-30B-A3B, a multimodal large language model with 31 billion parameters that processes video, audio, images, and text with up to 256K token context. The model uses a Mamba2-Transformer hybrid Mixture of Experts architecture and supports chain-of-thought reasoning mode.
xAI releases Grok 4.3 reasoning model with 1M token context at $1.25/M input tokens
xAI has released Grok 4.3, a reasoning model with a 1 million token context window and no output token limit. The model accepts text and image inputs, has always-on reasoning that cannot be disabled, and uses tiered pricing starting at $1.25 per million input tokens and $2.50 per million output tokens.
Mistral Releases Medium 3.5: 128B Dense Model With 256k Context and Configurable Reasoning
Mistral AI released Mistral Medium 3.5, a 128B parameter dense model with a 256k context window that unifies instruction-following, reasoning, and coding capabilities. The model features configurable reasoning effort per request and a vision encoder trained from scratch for variable image sizes.
NVIDIA Releases Nemotron 3 Nano Omni: 31B Multimodal Model With 256K Context and Reasoning Mode
NVIDIA released Nemotron 3 Nano Omni, a 31B parameter (30B active, 3B per token) multimodal model supporting video, audio, image, and text inputs. The model features a 256K token context window, reasoning mode with chain-of-thought, and tool calling capabilities.
NVIDIA Releases Nemotron 3 Nano Omni: 31B-Parameter Multimodal Model with 256K Context and Reasoning Mode
NVIDIA has released Nemotron 3 Nano Omni 30B-A3B, a multimodal large language model with 31 billion parameters using a Mamba2-Transformer hybrid Mixture of Experts architecture. The model supports video, audio, image, and text inputs with a 256K token context window and includes a dedicated reasoning mode with chain-of-thought capabilities.
NVIDIA Nemotron 3 Nano Omni: 30B-parameter multimodal model launches on AWS SageMaker with 131K token context
NVIDIA has launched Nemotron 3 Nano Omni on Amazon SageMaker JumpStart, a multimodal model with 30 billion total parameters (3 billion active) that processes video, audio, images, and text in a single inference pass. The model features a 131K token context window and uses a Mamba2 Transformer Hybrid MoE architecture combining three specialized encoders.
Nvidia releases Nemotron 3 Nano Omni: 30B-parameter multimodal model with 256K context, free on OpenRouter
Nvidia has released Nemotron 3 Nano Omni, a 30-billion-parameter multimodal model available free on OpenRouter. The model features a 256,000-token context window, accepts text, image, video, and audio inputs, and claims 2× higher throughput for video reasoning compared to separate vision and speech pipelines.
NVIDIA Releases Nemotron 3 Nano Omni: 30B-A3B Multimodal Model With 100+ Page Document Support
NVIDIA released Nemotron 3 Nano Omni, a 30B-A3B Mixture-of-Experts model that processes text, images, video, and audio. The model uses a hybrid Mamba-Transformer architecture with 128 experts and achieves 65.8 on OCRBenchV2-En and 72.2 on Video-MME, while delivering up to 9x higher throughput on multimodal tasks compared to alternatives.
Xiaomi releases MiMo-V2.5: 310B parameter omnimodal model with 1M token context window
Xiaomi released MiMo-V2.5, a 310B total parameter sparse mixture-of-experts model that activates 15B parameters per token. The omnimodal model supports text, image, video, and audio understanding with a 1M token context window and was trained on 48T tokens using FP8 mixed precision.
ChatGPT Images 2.0 scores 97% in head-to-head image generation benchmark against Google's Gemini Nano Banana at 85%
OpenAI's ChatGPT Images 2.0 scored 97% versus Google's Gemini Nano Banana at 85% in a nine-test image generation benchmark conducted by ZDNET. The tests measured capabilities including image restoration, text rendering, and prompt adherence, with Nano Banana losing points primarily for fabricating details and text errors.
Alibaba releases Qwen3.5 Plus with 1M token context window at $0.40 per million input tokens
Alibaba released an updated version of Qwen3.5 Plus on April 27, 2026, with a 1 million token context window. The multimodal model accepts text, image, and video input and is priced at $0.40 per million input tokens and $2.40 per million output tokens, with tiered pricing above 256K tokens.
Alibaba Qwen Releases Qwen3.6 Flash with 1M Context Window at $0.25 per 1M Input Tokens
Alibaba's Qwen team has released Qwen3.6 Flash, a multimodal language model supporting text, image, and video input with a 1 million token context window. The model is priced at $0.25 per 1M input tokens and $1.50 per 1M output tokens, with tiered pricing above 256K tokens.
Alibaba's Qwen Team Releases Qwen3.6 27B With 262K Context Window and Video Processing
Alibaba's Qwen Team has released Qwen3.6 27B, a 27-billion parameter multimodal language model with a 262,144-token context window. The model accepts text, image, and video inputs and includes a built-in thinking mode for extended reasoning, with pricing at $0.195 per million input tokens and $1.56 per million output tokens.
OpenAI Releases GPT-5.5 Pro with 1M+ Token Context Window, $30 Per Million Input Tokens
OpenAI has released GPT-5.5 Pro, a high-capability model with a 1,050,000 token context window (922K input, 128K output) priced at $30 per million input tokens and $180 per million output tokens. The model supports text and image inputs and is optimized for deep reasoning, agentic coding, and multi-step workflows.
Baidu Releases Free Qianfan-OCR-Fast Model with 65K Context Window
Baidu has released Qianfan-OCR-Fast, a specialized OCR model with a 65,536 token context window, available at zero cost through OpenRouter. The model launched on April 20, 2026, and is positioned as a performance upgrade over the original Qianfan-OCR.
Alibaba releases Qwen3.6-27B with 262K context window, scores 53.5% on SWE-bench Pro
Alibaba has released Qwen3.6-27B, a 27-billion parameter language model with a native 262,144 token context window (extensible to 1,010,000 tokens). The model achieves 53.5% on SWE-bench Pro and 77.2% on SWE-bench Verified, with FP8 quantization providing near-identical performance to the full-precision version.
Xiaomi Launches MiMo-V2.5 With 1M Context Window at $0.40 per Million Input Tokens
Xiaomi released MiMo-V2.5 on April 22, 2026, a native omnimodal model with a 1,048,576 token context window. The model is priced at $0.40 per million input tokens and $2 per million output tokens, positioning it as a cost-efficient alternative for agentic applications requiring multimodal perception across image and video understanding.
Alibaba Qwen Releases 27B Parameter Model with 262K Context Window, Claims 77.2% on SWE-bench Verified
Alibaba Qwen released Qwen3.6-27B, a 27-billion parameter model with a 262,144 token context window extensible to 1,010,000 tokens. The model claims 77.2% on SWE-bench Verified and 53.5% on SWE-bench Pro, with open weights available on Hugging Face.
OpenAI Releases GPT-5.4 Image 2 with 272K Context Window and Image Generation
OpenAI has released GPT-5.4 Image 2, combining the GPT-5.4 reasoning model with image generation capabilities. The multimodal model features a 272K token context window and is priced at $8 per million input tokens and $15 per million output tokens.
OpenAI releases ChatGPT Images 2.0 with 3840x2160 resolution at $30 per 1M output tokens
OpenAI released ChatGPT Images 2.0, pricing output tokens at $30 per million with maximum resolution of 3840x2160 pixels. CEO Sam Altman claims the improvement from gpt-image-1 to gpt-image-2 equals the jump from GPT-3 to GPT-5.
OpenAI announces gpt-image-2 model with improved text rendering and UI generation
OpenAI is set to announce gpt-image-2, its next-generation image generation model, on April 21, 2026 at 12pm PT. The company's teaser demonstrates improved capabilities in rendering text and generating realistic user interfaces from text prompts.
Moonshot AI Releases Kimi K2.6: 1T-Parameter MoE Model with 256K Context and Agent Swarm Capabilities
Moonshot AI has released Kimi K2.6, an open-source multimodal model with 1 trillion total parameters (32B activated) and 256K context window. The model achieves 80.2% on SWE-Bench Verified, 58.6% on SWE-Bench Pro, and supports horizontal scaling to 300 sub-agents executing 4,000 coordinated steps.
Anthropic ships Claude Opus 4.7 with improved coding reliability and multimodal capabilities
Anthropic has released Claude Opus 4.7, its latest generally available AI model focused on advanced software engineering. The model shows improvements in handling complex coding tasks with less supervision, enhanced vision capabilities, and better instruction following, while introducing a new tokenizer that increases token usage by 1.0-1.35× depending on content type.
Tencent Releases HY-World 2.0: Open-Source Multi-Modal Model Generates 3D Worlds from Text and Images
Tencent has released HY-World 2.0, an open-source multi-modal world model that generates navigable 3D environments from text prompts, single images, multi-view images, or video. The model produces editable 3D assets including meshes and 3D Gaussian Splattings that can be directly imported into game engines like Unity and Unreal Engine.
Meta releases Llama Guard 4, a 12B parameter multimodal safety classifier with 164K context window
Meta has released Llama Guard 4, a 12-billion parameter content safety classifier derived from Llama 4 Scout. The model features a 163,840 token context window and can classify both text and image content, available free through OpenRouter with an August 31, 2024 knowledge cutoff.
Google releases native Gemini app for Mac with Option+Space quick access
Google has launched a native Gemini app for macOS 15 and later, accessible via Option+Space keyboard shortcut. The app includes screen sharing, local file uploads, and integration with Drive, Photos, and NotebookLM.
Google DeepMind releases Gemini 3.1 Flash TTS with audio tags for precise speech control across 70+ languages
Google DeepMind launched Gemini 3.1 Flash TTS, a text-to-speech model that achieved an Elo score of 1,211 on the Artificial Analysis TTS leaderboard. The model introduces audio tags that allow developers to control vocal style, pace, and delivery through natural language commands embedded in text input, with support for 70+ languages.
LPM 1.0 generates 45-minute real-time lip-synced video from single photo, no public release planned
Researchers have introduced LPM 1.0, an AI model that generates real-time video of a speaking, listening, or singing character from a single image, with lip-synced speech and facial expressions stable for up to 45 minutes. The system integrates directly with voice AI models like ChatGPT but remains a research project with no planned public release.
Google releases Gemma 4, open-source on-device AI with agentic tool use for phones
Google released Gemma 4, an open-source multimodal model that runs entirely on smartphones without sending data to the cloud. The E2B and E4B variants require just 6GB and 8GB of RAM respectively and can autonomously use tools like Wikipedia, maps, and QR code generators through built-in agent skills. The model is available free via the Google AI Edge Gallery app for Android and iOS.
Liquid AI releases LFM2.5-VL-450M, improved 450M-parameter vision-language model with multilingual support
Liquid AI has released LFM2.5-VL-450M, a refreshed 450M-parameter vision-language model built on an updated LFM2.5-350M backbone. The model features a 32,768-token context window, supports 9 languages, handles native 512×512 pixel images, and adds bounding box prediction and function calling capabilities. Performance improvements span both vision and language benchmarks compared to its predecessor.
Tencent releases HY-Embodied-0.5, a 2B-parameter vision-language model for robot control
Tencent has released HY-Embodied-0.5, a family of foundation models designed specifically for embodied AI and robotic control. The suite includes a 2B-parameter MoT (Mixture-of-Transformers) variant with only 2.2B activated parameters during inference, and a 32B model that claims frontier-level performance comparable to Gemini 3.0 Pro, trained on over 200 billion tokens of embodied-specific data.
Meta AI app jumps to No. 5 on App Store following Muse Spark launch
Meta's AI app surged from No. 57 to No. 5 on the U.S. App Store within 24 hours of launching Muse Spark, Meta's new multimodal AI model. The model accepts voice, text, and image inputs and features reasoning capabilities for science and math tasks, visual coding, and multi-agent functionality.
Meta launches Muse Spark model with private API preview and 16 integrated tools
Meta announced Muse Spark today, its first model release since Llama 4 a year ago. The hosted model is available in private API preview and on meta.ai with Instant and Thinking modes, benchmarking competitively against Anthropic's Opus 4.6 and Google's Gemini 3.1 Pro, though behind on Terminal-Bench 2.0.
Meta releases Muse Spark, first closed-source model from Meta Superintelligence Labs
Meta has released Muse Spark, the first model from Meta Superintelligence Labs, a unit assembled under chief AI officer Alexandr Wang following Meta's $14.3 billion investment in Scale AI. The natively multimodal model features a "Contemplating" reasoning mode with parallel sub-agents and marks Meta's break from its open-source Llama heritage by operating as closed source.
Amazon Nova Multimodal Embeddings adds audio search capabilities to Bedrock
Amazon Nova Multimodal Embeddings, announced October 28, 2025, now supports audio content for semantic search alongside text, images, and video. The model offers four embedding dimension options (3,072, 1,024, 384, 256) and uses Matryoshka Representation Learning to balance accuracy with storage efficiency.
Meta launches Muse Spark, its first frontier model and first closed-weight AI system
Meta Superintelligence Labs has launched Muse Spark, a native multimodal reasoning model that scores 52 on the Artificial Analysis Intelligence Index, placing it in the top 5 frontier models. This marks Meta's first frontier-class model and its first AI system without open weights, representing a strategic shift from its open-source Llama strategy. The model achieves comparable efficiency to Gemini 3.1 Pro while matching Llama 4 Maverick capabilities with over an order of magnitude less compute.
Meta launches Muse Spark, its first model from revamped AI labs
Meta Superintelligence Labs has launched Muse Spark, its first model since Mark Zuckerberg restructured the company's AI division. The multimodal model now powers Meta AI's app and website in the US, with rollout planned for WhatsApp, Instagram, Facebook, Messenger, and Meta's smart glasses in coming weeks.
Google DeepMind releases Gemma 4 with four model sizes, up to 256K context, multimodal support
Google DeepMind released Gemma 4, an open-weights multimodal model family in four sizes (2.3B to 31B parameters) with context windows up to 256K tokens. All models support text and image input, with audio native to E2B and E4B variants. The Gemma 4 31B dense model scores 85.2% on MMLU Pro, 89.2% on AIME 2026, and 80.0% on LiveCodeBench—significant improvements over Gemma 3.
Google releases Gemma 4 26B with 256K context and multimodal support, free to use
Google DeepMind has released Gemma 4 26B A4B, a free instruction-tuned Mixture-of-Experts model with 262,144 token context window and multimodal capabilities including text, images, and video input. Despite 25.2B total parameters, only 3.8B activate per token, delivering performance comparable to larger 31B models at reduced compute cost.
Google releases Gemma 4 31B free model with 256K context and multimodal support
Google DeepMind has released Gemma 4 31B Instruct, a free 30.7-billion parameter model with a 256K token context window, multimodal text and image input capabilities, and native function calling. The model supports configurable reasoning mode and 140+ languages, with strong performance on coding and document understanding tasks under Apache 2.0 license.
Amazon Nova 2 Sonic enables real-time AI podcast generation with 1M token context
Amazon has published a technical guide for building real-time conversational podcasts using Amazon Nova 2 Sonic, its speech understanding and generation model. The solution demonstrates streaming audio generation, multi-turn dialogue between AI hosts, and stage-aware content filtering through a web interface.
Google DeepMind releases Gemma 4 family: multimodal models from 2.3B to 31B parameters with 256K context
Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active parameters), and 31B dense. All models support text and image input with 128K-256K context windows; E2B and E4B add native audio capabilities. Models feature reasoning modes, function calling, and multilingual support across 140+ languages.
Tencent releases HY-OmniWeaving multimodal model as Gemma-4 variants emerge
Tencent has released HY-OmniWeaving, a new multimodal model available on Hugging Face. Concurrently, NVIDIA and Unsloth have published optimized variants of Gemma-4, including a 31B instruction-tuned version and quantized GGUF format.
Tencent releases OmniWeaving, open-source video generation model with reasoning and multi-modal composition
Tencent's Hunyuan team released OmniWeaving on April 3, 2026, an open-source video generation model designed to compete with proprietary systems like Seedance-2.0. The model combines multimodal composition, reasoning-informed capabilities, and supports eight video generation tasks including text-to-video, image-to-video, video editing, and compositional generation.
NVIDIA releases Gemma 4 31B quantized model with 256K context, multimodal capabilities
NVIDIA has released a quantized version of Google DeepMind's Gemma 4 31B IT model, compressed to NVFP4 format for efficient inference on consumer GPUs. The 30.7B-parameter multimodal model supports 256K token context windows, handles text and image inputs with video frame processing, and maintains near-baseline performance across reasoning and coding benchmarks.
Google DeepMind releases Gemma 4 with multimodal reasoning and up to 256K context window
Google DeepMind released Gemma 4, a multimodal model family supporting text, images, video, and audio with context windows up to 256K tokens. The release includes four sizes (E2B, E4B, 26B A4B, and 31B) designed for deployment from mobile devices to servers. The 31B dense model achieves 85.2% on MMLU Pro and 89.2% on AIME 2026.
Google DeepMind releases Gemma 4 with four models up to 31B parameters, 256K context window
Google DeepMind released Gemma 4, an open-weights multimodal model family in four sizes (E2B, E4B, 26B A4B, 31B) with context windows up to 256K tokens and native reasoning capabilities. The 26B A4B variant uses Mixture-of-Experts architecture with 3.8B active parameters for efficient inference. All models support text, image input and handle 140+ languages with Apache 2.0 licensing.