OpenAI GPT-5.5 scores 93/100 in benchmark test, loses points for ignoring instructions
OpenAI's GPT-5.5 scored 93 out of 100 points in a 10-round benchmark test covering summarization, reasoning, coding, and creative tasks. The model lost points primarily for ignoring specific instructions, such as using unauthorized sources when asked to summarize from a single news outlet.
GPT-5.5 — Quick Specs
OpenAI GPT-5.5 scores 93/100 in benchmark test, loses points for ignoring instructions
OpenAI's GPT-5.5 scored 93 out of 100 points in a 10-round benchmark test covering summarization, reasoning, coding, and creative tasks. The model lost points primarily for ignoring specific instructions, such as using unauthorized sources when asked to summarize from a single news outlet.
According to testing by ZDNET, GPT-5.5 shows improvements in agentic coding, conceptual clarity, scientific research ability, and accuracy during knowledge work compared to GPT-5.4. The model is currently available only to ChatGPT Plus subscribers and above, accessible through the "Thinking" effort level in both Standard and Extended modes.
Test performance breakdown
The model achieved perfect 10/10 scores on seven of ten tests:
- Academic concept explanation (explaining educational constructivism to a five-year-old)
- Math and pattern recognition (correctly identifying and extending the Fibonacci sequence)
- Cultural discussion (analyzing social media's impact on communication)
- Literary analysis (identifying themes in Game of Thrones)
- Travel itinerary planning (creating a week-long Boston vacation focused on technology and history)
- Coding tasks
- Creative writing
The model scored 9/10 on travel itinerary planning and 5/10 on news summarization. According to the tester, GPT-5.5 "did correctly summarize the meat of the story, but it didn't follow my instructions to use Yahoo News as the source." Instead of using the specified single source, the model pulled information from AP, The Sun, Wall Street Journal, The Guardian, and Wikipedia.
Development velocity increase
OpenAI's release cadence has accelerated significantly. GPT-5.5 follows closely after GPT-5.4 and the launch of ChatGPT Images 2.0 earlier in the same week. According to the report, this increased pace is "most likely because AI coding has significantly reduced OpenAI's development time."
The tester used ChatGPT 5.5 Thinking with Images 2.0 to generate a release cadence visualization chart in under 10 minutes—a task that would have previously required at least two hours of manual work.
Instruction-following concerns
The testing revealed a pattern of "overeagerness" where GPT-5.5 performs additional work beyond what was requested. The tester noted: "If I had wanted a comprehensive news answer, that would have been fine. But the prompt specifically said to look at Yahoo News, and GPT-5.5 pretty much ignored that instruction."
This behavior raises concerns about autonomous agent capabilities. The tester stated: "If even a simple summary prompt can't be followed correctly, it does not give me confidence that it's safe to let agents run wild on long-horizon projects."
What this means
GPT-5.5 represents incremental improvements in reasoning and output quality, but OpenAI has not solved the fundamental instruction-following problem that has plagued large language models. The tension between capability and controllability becomes more critical as the industry pushes toward autonomous agents. For practical applications requiring strict adherence to guidelines—legal work, medical documentation, financial analysis—this "overeagerness" represents a reliability gap that limits production deployment. The rapid release cycle suggests OpenAI is iterating quickly, but the persistence of instruction-following issues indicates these may be architectural limitations rather than easily patchable bugs.
Related Articles
OpenAI rolls out ChatGPT Lockdown mode to all users to block prompt injection data theft
OpenAI has expanded Lockdown mode to all ChatGPT plan tiers, including Free, Go, Plus, Pro, and Business users. The security feature blocks outbound network requests to prevent prompt injection attacks from stealing sensitive data, but disables live web browsing, Deep Research, and Agent mode.
OpenAI's ChatGPT Memory V3 now profiles users across all conversations, raises accuracy and privacy concerns
OpenAI has deployed Dreaming V3, a background memory synthesis system that builds comprehensive user profiles from chat history. The company reports factual task recall jumped from 41% in 2024 to 82% in 2026, while reducing compute costs by 5X. However, testing reveals the system stores outdated and incorrect information that persists even when users disable memory features.
OpenAI plans ChatGPT redesign to integrate coding tools, image generation, and third-party apps
OpenAI will roll out a redesigned ChatGPT interface in the coming weeks that integrates coding tools, image generation capabilities, and third-party applications from partners including Canva and Booking.com. The overhaul, first reported by The Financial Times, aims to shift users from simple chat interactions to multi-task workflows, particularly targeting enterprise customers.
OpenAI launches Lockdown Mode to block prompt injection data exfiltration attacks
OpenAI has released Lockdown Mode, an optional security setting that protects against prompt injection attacks by limiting network requests and image fetching in ChatGPT. The feature is designed for users handling sensitive data and disables some ChatGPT capabilities including Deep Research and Agent Mode.
Comments
Loading...