gpt-5

8 articles tagged with gpt-5

June 12, 2026
benchmark

Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro

Google's latest Android Bench results show Gemini 3.5 Flash ranking 6th with a 63.7% success rate, despite averaging $147.10 per benchmark run compared to Gemini 3.1 Pro Preview's $47.90. The newer model used 355.9 tokens per run versus 73.3 for its predecessor, while GPT 5.5 leads the benchmark at 74% success rate.

May 27, 2026
benchmark

Frontier AI Models Score Below 50% on First Enterprise IT Benchmark for Kubernetes Incident Response

Artificial Analysis and IBM Research have released ITBench-AA, the first benchmark evaluating AI models on enterprise Site Reliability Engineering tasks. Claude Opus 4.7 leads at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%—all frontier models score below 50% on Kubernetes incident response tasks requiring root-cause diagnosis across complex infrastructure.

May 17, 2026
changelogOpenAI

Cline CLI 3.0.6 Adds Support for GPT-5.2, GPT-5.4, and GPT-5.4-mini Models

Cline released CLI version 3.0.6 with updated ChatGPT provider model list. The patch adds support for codex variants and three new GPT-5 series models: gpt-5.2, gpt-5.4, and gpt-5.4-mini.

April 30, 2026
changelogOpenAI

OpenAI Fixed GPT-5.5's Goblin Obsession by Explicitly Banning Mythical Creature References

OpenAI discovered its GPT-5.1 through GPT-5.4 models developed an increasing fixation on goblins, gremlins, and other mythical creatures. The issue traced back to reinforcement learning rewards used to develop a discontinued 'Nerdy personality' feature, which persisted across model generations.

April 2, 2026
product update

Gemini 3.1 Pro launches in Augment Code at 2.6x cheaper than Claude Opus 4.6

Augment Code now offers Gemini 3.1 Pro alongside Claude Opus 4.6 and GPT-5.4. In head-to-head testing on structural refactoring tasks, Gemini matched or outperformed Opus while consuming 268 credits per task—46% cheaper than Opus's 488 credits—making it 2.6x more cost-effective per message in real-world usage.

March 26, 2026
benchmarkOpenAI

ARC-AGI-3 benchmark: frontier AI models score below 1%, humans solve all 135 tasks

The ARC Prize Foundation released ARC-AGI-3, an interactive benchmark requiring AI agents to explore environments, form hypotheses, and execute plans without instructions. All 135 environments were solved by untrained humans, yet frontier models—including Gemini 3.1 Pro Preview (0.37%), GPT 5.4 (0.26%), Opus 4.6 (0.25%), and Grok-4.20 (0.00%)—scored below 1%.

March 17, 2026
product update

DuckDuckGo adds GPT-5 mini and GPT-5.2 reasoning models to Duck.ai privacy chatbot

DuckDuckGo's Duck.ai chatbot platform now includes OpenAI's GPT-5 mini for free users and GPT-5.2 for subscribers, both with reasoning capabilities. The platform continues to anonymize all conversations by default, stripping metadata before routing chats to model providers including Anthropic, Meta, Mistral, and OpenAI.

February 28, 2026
benchmarkOpenAI

Frontier LLMs lose up to 33% accuracy in long conversations, study finds

Frontier language models including GPT-5.2 and Claude 4.6 experience accuracy degradation of up to 33% as conversations lengthen, according to new research. The finding suggests that extended context use within a single conversation introduces performance challenges even in state-of-the-art models.