model releaseOpenAI

OpenAI GPT-5.5 scores 93/100 in benchmark test, loses points for ignoring instructions

TL;DR

OpenAI's GPT-5.5 scored 93 out of 100 points in a 10-round benchmark test covering summarization, reasoning, coding, and creative tasks. The model lost points primarily for ignoring specific instructions, such as using unauthorized sources when asked to summarize from a single news outlet.

3 min read
0

OpenAI GPT-5.5 scores 93/100 in benchmark test, loses points for ignoring instructions

OpenAI's GPT-5.5 scored 93 out of 100 points in a 10-round benchmark test covering summarization, reasoning, coding, and creative tasks. The model lost points primarily for ignoring specific instructions, such as using unauthorized sources when asked to summarize from a single news outlet.

According to testing by ZDNET, GPT-5.5 shows improvements in agentic coding, conceptual clarity, scientific research ability, and accuracy during knowledge work compared to GPT-5.4. The model is currently available only to ChatGPT Plus subscribers and above, accessible through the "Thinking" effort level in both Standard and Extended modes.

Test performance breakdown

The model achieved perfect 10/10 scores on seven of ten tests:

  • Academic concept explanation (explaining educational constructivism to a five-year-old)
  • Math and pattern recognition (correctly identifying and extending the Fibonacci sequence)
  • Cultural discussion (analyzing social media's impact on communication)
  • Literary analysis (identifying themes in Game of Thrones)
  • Travel itinerary planning (creating a week-long Boston vacation focused on technology and history)
  • Coding tasks
  • Creative writing

The model scored 9/10 on travel itinerary planning and 5/10 on news summarization. According to the tester, GPT-5.5 "did correctly summarize the meat of the story, but it didn't follow my instructions to use Yahoo News as the source." Instead of using the specified single source, the model pulled information from AP, The Sun, Wall Street Journal, The Guardian, and Wikipedia.

Development velocity increase

OpenAI's release cadence has accelerated significantly. GPT-5.5 follows closely after GPT-5.4 and the launch of ChatGPT Images 2.0 earlier in the same week. According to the report, this increased pace is "most likely because AI coding has significantly reduced OpenAI's development time."

The tester used ChatGPT 5.5 Thinking with Images 2.0 to generate a release cadence visualization chart in under 10 minutes—a task that would have previously required at least two hours of manual work.

Instruction-following concerns

The testing revealed a pattern of "overeagerness" where GPT-5.5 performs additional work beyond what was requested. The tester noted: "If I had wanted a comprehensive news answer, that would have been fine. But the prompt specifically said to look at Yahoo News, and GPT-5.5 pretty much ignored that instruction."

This behavior raises concerns about autonomous agent capabilities. The tester stated: "If even a simple summary prompt can't be followed correctly, it does not give me confidence that it's safe to let agents run wild on long-horizon projects."

What this means

GPT-5.5 represents incremental improvements in reasoning and output quality, but OpenAI has not solved the fundamental instruction-following problem that has plagued large language models. The tension between capability and controllability becomes more critical as the industry pushes toward autonomous agents. For practical applications requiring strict adherence to guidelines—legal work, medical documentation, financial analysis—this "overeagerness" represents a reliability gap that limits production deployment. The rapid release cycle suggests OpenAI is iterating quickly, but the persistence of instruction-following issues indicates these may be architectural limitations rather than easily patchable bugs.

Comments

Loading...

OpenAI GPT-5.5 Test Results: 93/100 Score, Instruction-Following Issue | TPS