LLM News

Every LLM release, update, and milestone.

Filtered by:model-evaluation✕ clear

benchmark

Google benchmarks AI models for Android development; names top performers

Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.

March 6, 2026 · 12:05 PM1 min read

android google ai-coding

via 9to5google.com ↗

benchmark

UniG2U-Bench reveals unified multimodal models underperform VLMs in most tasks

A new comprehensive benchmark called UniG2U-Bench evaluates whether generation capabilities improve multimodal understanding across 30+ models. The findings show unified multimodal models generally underperform specialized Vision-Language Models, with generation-then-answer inference degrading performance in most cases—though spatial reasoning and multi-round tasks show consistent improvements.

March 5, 2026 · 1:08 AM2 min read

benchmark multimodal vision-language-models

via arxiv.org ↗

researchGoogle DeepMind

Google DeepMind argues chatbot ethics require same rigor as coding benchmarks

Google DeepMind is pushing for moral behavior in large language models to be evaluated with the same technical rigor applied to coding and math benchmarks. As LLMs take on roles like companions, therapists, and medical advisors, the research group argues current evaluation standards are insufficient.

February 20, 2026 · 4:39 AM2 min read

google-deepmind llm-safety ai-ethics

via technologyreview.com ↗