Liquid AI releases LFM2.5-VL-450M, improved 450M-parameter vision-language model with multilingual support
Liquid AI has released LFM2.5-VL-450M, a refreshed 450M-parameter vision-language model built on an updated LFM2.5-350M backbone. The model features a 32,768-token context window, supports 9 languages, handles native 512×512 pixel images, and adds bounding box prediction and function calling capabilities. Performance improvements span both vision and language benchmarks compared to its predecessor.
Liquid AI Releases LFM2.5-VL-450M Vision-Language Model
Liquid AI has released LFM2.5-VL-450M, a 450M-parameter vision-language model that serves as a refreshed version of LFM2-VL-450M. The model is built on an updated LFM2.5-350M language backbone and optimized for improved real-world performance.
Technical Specifications
LFM2.5-VL-450M features a 32,768-token context window and uses a SigLIP2 NaFlex vision encoder with 86M parameters. The model has a vocabulary size of 65,536 and supports 9 languages: English, Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish.
The model processes native 512×512 pixel images without upscaling and preserves non-standard aspect ratios without distortion. For larger images, a tiling strategy splits them into non-overlapping 512×512 patches with thumbnail encoding for global context. Maximum image tokens range from 32 to 256, tunable at inference time without retraining.
Liquid AI provides the model in four formats: native Transformers/vLLM checkpoints, GGUF quantization for llama.cpp and CPU inference, ONNX Runtime format for cross-platform deployment, and MLX 4-bit through bf16 variants optimized for Apple Silicon.
Capabilities and Performance
New capabilities in LFM2.5-VL-450M include:
- Enhanced instruction following on vision and language tasks
- Multilingual vision understanding across all 9 supported languages
- Bounding box prediction and object detection for grounded visual understanding (measured on RefCOCO-M)
- Function calling support for text-only input (measured by BFCLv4)
On vision benchmarks, LFM2.5-VL-450M shows consistent improvements over LFM2-VL-450M:
- MMStar: 43.00 vs 40.87
- RealWorldQA: 58.43 vs 52.03
- MMBench (dev en): 60.91 vs 56.27
- MMVet: 41.10 vs 33.85
- RefCOCO-M (bounding box prediction, new capability): 81.28
Language benchmark improvements include:
- GPQA: 25.66 vs 23.13
- MMLU Pro: 19.32 vs 17.22
- IFEval: 61.16 vs 51.75
- BFCLv4 (function calling, new capability): 21.08
Liquid AI recommends the model for general vision-language workloads, captioning, and object detection. The company explicitly states it is not well-suited for knowledge-intensive tasks or fine-grained OCR.
Deployment Options
The model supports inference through Hugging Face Transformers (v5.1+), vLLM, SGLang, and llama.cpp. Liquid AI provides fine-tuning notebooks using both Unsloth and TRL frameworks with LoRA adapters.
A real-time video stream captioning WebGPU demo is available for browser-based testing. The model also offers function calling support for text-only inputs through a ChatML-like template format.
Monthly downloads reached 3,522 as of the release announcement. All vision benchmark scores were obtained using VLMEvalKit, with multilingual scores based on benchmarks translated by GPT-4o-mini.
What This Means
LFM2.5-VL-450M targets the efficient vision-language segment where inference cost and latency matter more than maximum capability. The 450M-parameter size and multiformat deployment options make it viable for edge, mobile, and resource-constrained environments. The addition of object detection and function calling expands use cases beyond pure captioning. Performance gains over the prior version suggest meaningful tuning improvements, though the model remains positioned below larger competitors for knowledge-intensive applications.
Related Articles
Tencent Releases Hy-MT2 Translation Models: 1.8B, 7B, and 30B-A3B Support 33 Languages
Tencent released Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B (MoE) sizes. All models support translation among 33 languages and follow translation instructions in multiple languages. The 1.8B model can be compressed to 440MB using 1.25-bit AngelSlim quantization.
Tencent Releases Hy-MT2: 1.8B Translation Model Compressed to 440MB With 1.25-Bit Quantization
Tencent has open-sourced Hy-MT2, a family of multilingual translation models available in 1.8B, 7B, and 30B-A3B parameter sizes. The models support translation across 33 languages and include extreme quantization down to 1.25-bit, reducing the 1.8B model to 440MB storage while increasing inference speed by 1.5x.
Cohere Releases Command A+ Open Source Model with 25B Active Parameters, 128K Context
Cohere has released Command A+ as an open source model under Apache 2.0 license. The sparse mixture-of-experts architecture features 25 billion active parameters out of 218B total parameters, supports 128K input context length, and includes vision capabilities alongside tool use and reasoning features.
Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU
Cohere has released Command A+, an open-source sparse mixture-of-experts model with 218 billion total parameters and 25 billion active parameters. The model features W4A4 quantization allowing deployment on a single Nvidia B200 GPU, supports 128K input context, and includes built-in chain-of-thought reasoning with vision capabilities.
Comments
Loading...