Liquid AI releases LFM2.5-VL-450M, improved 450M-parameter vision-language model with multilingual support
Liquid AI has released LFM2.5-VL-450M, a refreshed 450M-parameter vision-language model built on an updated LFM2.5-350M backbone. The model features a 32,768-token context window, supports 9 languages, handles native 512×512 pixel images, and adds bounding box prediction and function calling capabilities. Performance improvements span both vision and language benchmarks compared to its predecessor.
Liquid AI Releases LFM2.5-VL-450M Vision-Language Model
Liquid AI has released LFM2.5-VL-450M, a 450M-parameter vision-language model that serves as a refreshed version of LFM2-VL-450M. The model is built on an updated LFM2.5-350M language backbone and optimized for improved real-world performance.
Technical Specifications
LFM2.5-VL-450M features a 32,768-token context window and uses a SigLIP2 NaFlex vision encoder with 86M parameters. The model has a vocabulary size of 65,536 and supports 9 languages: English, Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish.
The model processes native 512×512 pixel images without upscaling and preserves non-standard aspect ratios without distortion. For larger images, a tiling strategy splits them into non-overlapping 512×512 patches with thumbnail encoding for global context. Maximum image tokens range from 32 to 256, tunable at inference time without retraining.
Liquid AI provides the model in four formats: native Transformers/vLLM checkpoints, GGUF quantization for llama.cpp and CPU inference, ONNX Runtime format for cross-platform deployment, and MLX 4-bit through bf16 variants optimized for Apple Silicon.
Capabilities and Performance
New capabilities in LFM2.5-VL-450M include:
- Enhanced instruction following on vision and language tasks
- Multilingual vision understanding across all 9 supported languages
- Bounding box prediction and object detection for grounded visual understanding (measured on RefCOCO-M)
- Function calling support for text-only input (measured by BFCLv4)
On vision benchmarks, LFM2.5-VL-450M shows consistent improvements over LFM2-VL-450M:
- MMStar: 43.00 vs 40.87
- RealWorldQA: 58.43 vs 52.03
- MMBench (dev en): 60.91 vs 56.27
- MMVet: 41.10 vs 33.85
- RefCOCO-M (bounding box prediction, new capability): 81.28
Language benchmark improvements include:
- GPQA: 25.66 vs 23.13
- MMLU Pro: 19.32 vs 17.22
- IFEval: 61.16 vs 51.75
- BFCLv4 (function calling, new capability): 21.08
Liquid AI recommends the model for general vision-language workloads, captioning, and object detection. The company explicitly states it is not well-suited for knowledge-intensive tasks or fine-grained OCR.
Deployment Options
The model supports inference through Hugging Face Transformers (v5.1+), vLLM, SGLang, and llama.cpp. Liquid AI provides fine-tuning notebooks using both Unsloth and TRL frameworks with LoRA adapters.
A real-time video stream captioning WebGPU demo is available for browser-based testing. The model also offers function calling support for text-only inputs through a ChatML-like template format.
Monthly downloads reached 3,522 as of the release announcement. All vision benchmark scores were obtained using VLMEvalKit, with multilingual scores based on benchmarks translated by GPT-4o-mini.
What This Means
LFM2.5-VL-450M targets the efficient vision-language segment where inference cost and latency matter more than maximum capability. The 450M-parameter size and multiformat deployment options make it viable for edge, mobile, and resource-constrained environments. The addition of object detection and function calling expands use cases beyond pure captioning. Performance gains over the prior version suggest meaningful tuning improvements, though the model remains positioned below larger competitors for knowledge-intensive applications.
Related Articles
Tencent releases HY-Embodied-0.5, a 2B-parameter vision-language model for robot control
Tencent has released HY-Embodied-0.5, a family of foundation models designed specifically for embodied AI and robotic control. The suite includes a 2B-parameter MoT (Mixture-of-Transformers) variant with only 2.2B activated parameters during inference, and a 32B model that claims frontier-level performance comparable to Gemini 3.0 Pro, trained on over 200 billion tokens of embodied-specific data.
Meta launches proprietary Muse Spark, abandoning open-source strategy after $14.3B rebuild
Meta launched Muse Spark on April 8, 2026, a natively multimodal reasoning model with tool-use and visual chain-of-thought capabilities. Unlike Llama, it is entirely proprietary with no open weights. The model scores 52 on AI Index v4.0 and excels on health benchmarks but represents Meta's departure from its open-source identity.
Meta AI app jumps to No. 5 on App Store following Muse Spark launch
Meta's AI app surged from No. 57 to No. 5 on the U.S. App Store within 24 hours of launching Muse Spark, Meta's new multimodal AI model. The model accepts voice, text, and image inputs and features reasoning capabilities for science and math tasks, visual coding, and multi-agent functionality.
Meta launches Muse Spark model with private API preview and 16 integrated tools
Meta announced Muse Spark today, its first model release since Llama 4 a year ago. The hosted model is available in private API preview and on meta.ai with Instant and Thinking modes, benchmarking competitively against Anthropic's Opus 4.6 and Google's Gemini 3.1 Pro, though behind on Terminal-Bench 2.0.
Comments
Loading...