model release

Alibaba releases Qwen3.5-4B, a 4B multimodal model for vision and text tasks

TL;DR

Alibaba's Qwen team has released Qwen3.5-4B, a 4 billion parameter multimodal model capable of processing both images and text. The model is available on Hugging Face under an Apache 2.0 license, making it freely available for commercial and research use.

2 min read
0

Alibaba Releases Qwen3.5-4B Multimodal Model

Alibaba's Qwen team has released Qwen3.5-4B, a 4 billion parameter multimodal model designed to handle both image and text inputs. The model was published on Hugging Face on February 27, 2026.

Model Specifications

Qwen3.5-4B is positioned as a lightweight multimodal model with 4 billion parameters. It supports image-text-to-text tasks, enabling users to provide images and text prompts and receive text responses. The model is available in base form (Qwen3.5-4B-Base) with instruction-tuned variants also released.

The model uses the safetensors format for model weights and is compatible with standard transformers pipelines and Hugging Face Endpoints.

Licensing and Availability

Qwen3.5-4B is released under the Apache 2.0 license, permitting free use for both commercial and non-commercial applications. This represents a fully open release with no usage restrictions. The model is available directly from Hugging Face's model hub.

Architecture and Capabilities

The model is tagged for conversational use cases and image-text-to-text applications. At 4 billion parameters, it targets the efficiency segment of the market—suitable for deployment on resource-constrained hardware while maintaining multimodal capabilities.

As of publication, the model has received 60 likes and 41 downloads on Hugging Face, indicating early interest from the open-source community.

Community Reception

The release includes evaluation results published alongside the model weights, following Alibaba's standard practice of providing benchmark data for model transparency. The model is marked as compatible with Hugging Face Endpoints for easy deployment.

What This Means

Qwen3.5-4B extends Alibaba's Qwen family into the efficient multimodal space at a smaller scale than previous releases. The 4B parameter count makes it suitable for edge deployment and fine-tuning on limited hardware, while Apache 2.0 licensing removes legal barriers to adoption. This positions the model as a competitive option for developers needing lightweight vision-language capabilities without commercial restrictions. The release reflects continued competition in the open-source multimodal space, where parameter efficiency and licensing terms are becoming primary differentiators.

Related Articles

model release

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

Google DeepMind has released Gemma 4 12B Unified, an encoder-free multimodal model that processes text, images, and audio through a single decoder-only transformer. The model features 11.95 billion parameters, a 256K token context window, and achieves 77.2% on MMLU Pro and 72.0% on LiveCodeBench v6.

model release

Alibaba's Qwen Releases Qwen3.7 Plus: 1M Context Window at $0.40 Per Million Input Tokens

Alibaba's Qwen has released Qwen3.7 Plus, a multimodal model with a 1 million token context window. The model accepts text and image input with text output, priced at $0.40 per million input tokens and $1.60 per million output tokens through OpenRouter's API.

model release

Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters

Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.

model release

ByteDance Open-Sources Bernini-R Video Diffusion Model With Semantic Planning Architecture

ByteDance released Bernini-R, an open-source video generation and editing model that combines an MLLM-based semantic planner with a DiT-based renderer. The model requires Hopper-class GPUs (H100/H800/H200) for optimal performance and supports multiple tasks including text-to-video, video editing, and reference-guided generation.

Comments

Loading...