model-evaluation
3 articles tagged with model-evaluation
OpenAI acquires Promptfoo to strengthen AI agent security capabilities
OpenAI has acquired Promptfoo, a platform for testing and evaluating AI agents. The acquisition signals frontier labs' intensifying focus on proving their technology can operate safely in critical business environments.
Google benchmarks AI models for Android development; names top performers
Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.
Google DeepMind argues chatbot ethics require same rigor as coding benchmarks
Google DeepMind is pushing for moral behavior in large language models to be evaluated with the same technical rigor applied to coding and math benchmarks. As LLMs take on roles like companions, therapists, and medical advisors, the research group argues current evaluation standards are insufficient.