LLM News

Every LLM release, update, and milestone.

Filtered by:multimodal✕ clear
benchmark

OmniVideoBench: New 1,000-question benchmark exposes gaps in audio-visual AI reasoning

Researchers have introduced OmniVideoBench, a large-scale evaluation framework comprising 1,000 manually verified question-answer pairs derived from 628 videos (ranging from seconds to 30 minutes) designed to measure synergistic audio-visual reasoning in multimodal large language models. Testing reveals a significant performance gap between open-source and closed-source MLLMs on genuine cross-modal reasoning tasks.

research

Researchers develop data synthesis method to improve multimodal AI reasoning on charts and documents

A new research paper proposes COGS (COmposition-Grounded data Synthesis), a framework that decomposes questions into primitive perception and reasoning factors to generate synthetic training data. The method substantially improves multimodal model performance on chart reasoning and document understanding tasks with minimal human annotation.

benchmark

New benchmark evaluates music reward models trained on text, lyrics, and audio

Researchers have released CMI-RewardBench, a comprehensive evaluation framework for music reward models that handle mixed text, lyrics, and audio inputs. The benchmark includes 110,000 pseudo-labeled samples and human-annotated data, along with publicly available reward models designed for fine-grained music generation alignment.

benchmark

UniG2U-Bench reveals unified multimodal models underperform VLMs in most tasks

A new comprehensive benchmark called UniG2U-Bench evaluates whether generation capabilities improve multimodal understanding across 30+ models. The findings show unified multimodal models generally underperform specialized Vision-Language Models, with generation-then-answer inference degrading performance in most cases—though spatial reasoning and multi-round tasks show consistent improvements.

benchmark

CFE-Bench: New STEM reasoning benchmark reveals frontier models struggle with multi-step logic

Researchers introduced CFE-Bench (Classroom Final Exam), a multimodal benchmark using authentic university homework and exam problems across 20+ STEM domains to evaluate LLM reasoning capabilities. Gemini 3.1 Pro Preview achieved the highest score at 59.69% accuracy, while analysis revealed frontier models frequently fail to maintain correct intermediate states in multi-step solutions.

2 min readvia arxiv.org
research

MedXIAOHE: New medical vision-language model claims state-of-the-art performance on clinical benchmarks

Researchers have published MedXIAOHE, a medical multimodal foundation model designed for clinical applications. According to the authors, the model achieves state-of-the-art performance across diverse medical benchmarks and surpasses several closed-source multimodal systems on multiple capabilities.

model release

Alibaba releases Qwen3.5-35B-A3B-FP8, a quantized multimodal model for efficient deployment

Alibaba's Qwen team released Qwen3.5-35B-A3B-FP8 on Hugging Face, a quantized version of their 35-billion parameter multimodal model. The FP8 quantization reduces model size and memory requirements while maintaining the base model's image-text-to-text capabilities. The model is compatible with standard Transformers endpoints and Azure deployment.

1 min readvia huggingface.co