model-evaluation

5 articles tagged with model-evaluation

June 2, 2026

Claude Opus 4.8 fails legal reasoning test despite improved honesty scores

Anthropic's Claude Opus 4.8 demonstrated better uncertainty handling than its predecessor in independent testing across coding, medical, and financial scenarios. However, the model exhibited a significant judgment error in a legal reasoning test involving travel insurance claims, according to results published by ZDNET.

June 2, 2026 · 12:51 PM

April 27, 2026

product update

Popsa generates 5.5M personalized photo book titles using Amazon Nova, cuts costs with 73% user satisfaction

Popsa, a photo book service operating in 50+ countries, generated over 5.5 million AI-powered titles in 2025 using Amazon Nova models. The company achieved 73% positive user feedback with Nova Pro while reducing costs and latency compared to Claude 3 Haiku.

April 27, 2026 · 5:05 PM

March 9, 2026

product updateOpenAI

OpenAI acquires Promptfoo to strengthen AI agent security capabilities

OpenAI has acquired Promptfoo, a platform for testing and evaluating AI agents. The acquisition signals frontier labs' intensifying focus on proving their technology can operate safely in critical business environments.

March 9, 2026 · 6:05 PM

March 6, 2026

benchmark

Google benchmarks AI models for Android development; names top performers

Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.

March 6, 2026 · 12:05 PM

February 20, 2026

researchGoogle DeepMind

Google DeepMind argues chatbot ethics require same rigor as coding benchmarks

Google DeepMind is pushing for moral behavior in large language models to be evaluated with the same technical rigor applied to coding and math benchmarks. As LLMs take on roles like companions, therapists, and medical advisors, the research group argues current evaluation standards are insufficient.

February 20, 2026 · 4:39 AM

← Back to all news