On Wednesday evening, while most of Silicon Valley was still digesting the week's drama around AI military contracts and user boycotts, OpenAI quietly pushed a button. GPT-5.4 went live across ChatGPT, the API, and Codex simultaneously. No countdown. No keynote. Just a blog post and a new model in the dropdown.
The timing was not accidental.
OpenAI has spent the past two weeks watching 1.5 million users walk out the door. The #CancelChatGPT movement sent ChatGPT uninstalls up 295%. Claude briefly hit number one on the App Store. CEO Sam Altman told staff the Pentagon deal backlash was "really painful," admitting publicly that the whole thing "looked opportunistic and sloppy." GPT-5.4, billed as "our most capable and efficient frontier model for professional work," is OpenAI's answer to all of it.
And on paper, it is a serious answer.
The First AI That Operates a Computer Better Than a Human
The headline number is striking and worth sitting with: 75% on OSWorld-Verified, a benchmark that tests whether an AI agent can actually operate a desktop computer. Open applications, click buttons, fill forms, switch between windows, complete multi-step workflows. The kind of work that every knowledge worker does for eight hours a day.
Human performance on that same benchmark sits at 72.4%.
GPT-5.4 is the first general-purpose large language model to cross that line. Its predecessor, GPT-5.2, scored 47.3%. That is not an incremental improvement. It is a 58% jump in a single generation, and it signals something the industry has been racing toward since Anthropic introduced computer use with Claude Opus 4.6 last year: AI agents that do not just talk about work, but do the work.
The computer-use capability is native, meaning it is baked into the model's architecture rather than bolted on through external tooling. GPT-5.4 processes screenshots, issues mouse commands, types keyboard inputs, and handles complex multi-application workflows autonomously. No special agent framework required. No custom infrastructure.
On WebArena-Verified, which tests browser-based navigation, GPT-5.4 scored 67.3% using both DOM and screenshot-driven interaction, up from GPT-5.2's 65.4%.
Three Models in a Trench Coat
OpenAI shipped GPT-5.4 in three variants, each targeting a different use case.
The standard GPT-5.4 is the everyday model. It handles general queries, coding, analysis, and conversation. It prices at $2.50 per million input tokens and $15 per million output tokens, a modest bump from GPT-5.2's $1.75 and $14.
GPT-5.4 Thinking is the reasoning variant. Think of it as the chain-of-thought model that works through problems step by step before answering. It scored 92.8% on GPQA Diamond (scientific reasoning) and 73.3% on ARC-AGI-2 (abstract reasoning), compared to GPT-5.2 Pro's 54.2% on the latter.
GPT-5.4 Pro is the heavy hitter, priced accordingly at $30 per million input tokens and $180 per million output tokens. It is built for sustained, high-stakes professional work: investment banking models, legal analysis, multi-hour research tasks. Pro pushes the reasoning scores even higher: 94.4% on GPQA Diamond, 83.3% on ARC-AGI-2, and 38.0% on FrontierMath Tier 4, a benchmark designed to stump professional mathematicians.
All three share a million-token context window, the largest OpenAI has ever offered. Prompts under 272,000 tokens get standard pricing. Go beyond that, and input costs double while output costs jump 1.5x for the full session.
| Model | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) | Best For |
|---|---|---|---|
| GPT-5.4 | $2.50 | $15.00 | General use, coding, conversation |
| GPT-5.4 Thinking | $2.50 | $15.00 | Complex reasoning, math, science |
| GPT-5.4 Pro | $30.00 | $180.00 | Enterprise workflows, finance, law |
The Benchmark Blitz
Beyond computer use, GPT-5.4's numbers paint a picture of broad improvement.
On GDPval, OpenAI's internal benchmark for knowledge work across 44 professional occupations, GPT-5.4 hit 83.0%. GPT-5.2 scored 70.9%. That 12-point gap matters because GDPval tests the kind of work that fills most white-collar days: drafting reports, analyzing data, summarizing documents, building presentations.
When human evaluators compared GPT-5.4's presentation output against earlier models, they preferred the new version 68% of the time.
Error rates dropped meaningfully. Individual claims in GPT-5.4 responses are 33% less likely to be wrong compared to GPT-5.2. Complete answers are 18% less likely to contain any errors at all.
On coding, GPT-5.4 consolidates the capabilities of GPT-5.3 Codex, the dedicated coding model OpenAI shipped in December. SWE-Bench Pro scores hit 57.7%, a modest tick above GPT-5.3 Codex's 56.8%, but the real story is that coding ability now ships inside the general model rather than as a separate product. Developers no longer need to route between different models for different tasks.
The model also introduced what OpenAI calls a "/fast" mode that boosts token generation speed up to 1.5x, addressing the persistent complaint that reasoning models feel slow.
Tool Search Cuts Token Bills by 47%
Buried in the announcement was a feature that will matter more to developers than any benchmark: tool search.
Modern AI applications often define dozens or hundreds of tools (API endpoints, functions, database queries) that a model can call. Until now, every tool's full schema had to be loaded into the prompt on every single request, eating thousands of tokens before the model even read the user's question. For applications with large tool ecosystems, this overhead was brutal.
GPT-5.4's tool search flips the approach. The API receives a lightweight list of available tools, and the model looks up full definitions only when it decides to use one. OpenAI reports a 47% reduction in token consumption for tool-heavy applications.
For teams running thousands of API calls per day, that translates directly to lower bills, even at GPT-5.4's slightly higher per-token pricing.
Wall Street Gets Its Own AI
Alongside GPT-5.4, OpenAI launched ChatGPT for Excel in beta, an add-in that brings GPT-5.4 directly into Microsoft Excel workbooks. Users can build financial models, run scenario analysis, and generate outputs using natural language commands inside their spreadsheets.
The timing is pointed. OpenAI simultaneously announced financial data integrations with Moody's, Dow Jones Factiva, MSCI, Third Bridge, and MT Newswire, with FactSet coming soon. The message to Wall Street is clear: stop copy-pasting between ChatGPT and your spreadsheets.
On OpenAI's internal investment banking benchmark, which tests tasks like building three-statement financial models with proper formatting and citations, GPT-5.4 Thinking scored 87.3%. The original GPT-5 scored 43.7% on the same test. That is a doubling of capability in seven months.
The Excel add-in is initially available to Plus, Team, Enterprise, and Edu subscribers in the United States, Canada, and Australia.
The Competitive Scoreboard Gets Messier
GPT-5.4 enters a market where no single model dominates across all dimensions.
On abstract reasoning, Google's Gemini 3.1 Pro still leads with 94.3% on GPQA Diamond, edging GPT-5.4's 92.8% and Anthropic's Claude Opus 4.6 at 91.3%. On ARC-AGI-2, Gemini again tops the chart at 77.1%, ahead of Opus 4.6's 75.2% and GPT-5.4's 73.3%.
But GPT-5.4 owns computer use and knowledge work. No other model touches 75% on OSWorld. And the 83% GDPval score puts it firmly ahead of both competitors on professional task completion.
On coding, Claude Opus 4.6 still holds the crown with 80.8% on SWE-Bench Verified and strong marks on production-grade bug fixing. GPT-5.4's 57.7% on SWE-Bench Pro is respectable, but the gap remains real.
| Benchmark | GPT-5.4 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|
| OSWorld-Verified | 75.0% | N/A | N/A |
| GPQA Diamond | 92.8% | 91.3% | 94.3% |
| ARC-AGI-2 | 73.3% | 75.2% | 77.1% |
| SWE-Bench Verified | N/A | 80.8% | N/A |
| GDPval | 83.0% | N/A | N/A |
The picture that emerges is one of specialization. GPT-5.4 is the best model for autonomous computer operation and professional knowledge work. Claude Opus 4.6 remains the coding leader. Gemini 3.1 Pro offers the strongest pure reasoning at the best price point ($2 per million input tokens, cheapest of the three).
The Cybersecurity Elephant in the Room
GPT-5.4's system card contains a notable first: it is the first general-purpose model that OpenAI has classified as "High Capability" for cybersecurity. That is not a boast. It is a warning label.
The classification means GPT-5.4 is capable enough at offensive cybersecurity tasks that OpenAI built dedicated mitigations directly into the model. A two-tier monitoring system runs in real time: a fast topic classifier identifies whether a query touches cybersecurity territory, and a secondary AI security analyst determines whether the specific response falls within acceptable bounds.
OpenAI trained GPT-5.4 to provide helpful guidance on defensive cybersecurity while refusing operational instructions for malware creation, credential theft, and chained exploitation. The company deployed expanded monitoring, trusted access controls, and asynchronous blocking for high-risk requests.
This is the tension at the heart of frontier AI development. The same capability that lets GPT-5.4 autonomously operate a computer also makes it a more potent tool for bad actors. OpenAI's response is to ship the capability with guardrails rather than withhold it entirely.
Developers Are Cautiously Impressed
The developer community's reaction has been measured. On forums and social media, the dominant sentiment is not excitement about raw intelligence gains but appreciation for practical improvements.
"The raw logical reasoning does not feel dramatically smarter with each version," wrote developer Lars Nietvelt on DEV Community. "But what does improve is how much better these models get at understanding what you are asking for. That is useful. But it is not the same as becoming more intelligent."
The million-token context window drew genuine enthusiasm. Developers building AI tooling noted that feeding entire codebases, documentation chains, and log files into a single query "unlocks things that were not feasible before."
Multiple developers reported that GPT-5.4 finally resolved the persistent "lazy model" problem, where earlier versions would stall halfway through complex tasks or skip steps. Execution speed reportedly doubled compared to GPT-5.3.
OpenAI researcher Noam Brown pushed back on any narrative of slowing progress. "We see no wall," Brown stated, "and expect AI capabilities to continue to increase dramatically this year."
A Win OpenAI Desperately Needed
Gizmodo's headline captured the subtext that OpenAI could not say out loud: "OpenAI, in Desperate Need of a Win, Launches GPT-5.4."
The context matters. OpenAI's Pentagon partnership triggered the largest user exodus in the company's history. Anthropic's very public refusal to compromise its safety guardrails for military work created a stark contrast that users rewarded with their wallets and their app downloads. Altman acknowledged to staff that the backlash was "really painful" and that the deal "looked opportunistic and sloppy."
GPT-5.4 does not make any of that go away. But it does give OpenAI something concrete to point to: a model that can operate a computer better than a person, that hallucinates less, that costs less per task despite higher per-token pricing, and that comes bundled with the kind of enterprise financial tools that generate real revenue.
GPT-5.2 Thinking will remain available for three months before being retired on June 5, 2026, giving teams time to migrate.
The Bottom Line
GPT-5.4 is a genuinely strong model wrapped in a genuinely complicated moment. The benchmarks are real. The computer-use capability is a legitimate milestone, one that puts OpenAI ahead of Anthropic and Google in the specific race to build AI agents that can operate software autonomously. The financial integrations signal a company pivoting hard toward enterprise revenue.
But technology does not exist in a vacuum. OpenAI is shipping its most capable model at the exact moment its brand is most damaged. The question is whether performance can outrun reputation, whether developers and enterprises care more about OSWorld scores than Pentagon contracts.
Noam Brown says there is no wall. The users who left suggest there might be one that benchmarks cannot measure.
Sources
- Introducing GPT-5.4 | OpenAI (March 5, 2026)
- OpenAI launches GPT-5.4 with Pro and Thinking versions | TechCrunch (March 5, 2026)
- OpenAI launches GPT-5.4, its most powerful model for enterprise work | Fortune (March 5, 2026)
- OpenAI, in Desperate Need of a Win, Launches GPT-5.4 | Gizmodo (March 5, 2026)
- GPT-5.4 Thinking System Card | OpenAI (March 5, 2026)
- OpenAI launches GPT-5.4 Thinking and Pro | The Decoder (March 5, 2026)
- Introducing ChatGPT for Excel | OpenAI (March 5, 2026)
- GPT-5.4 dropped. The hype is not fully justified, but the shift is real | DEV Community (March 6, 2026)
- OpenAI's Altman takes jabs at Anthropic | CNBC (March 5, 2026)
- OpenAI launches GPT-5.4 with computer vision, tool use enhancements | SiliconANGLE (March 5, 2026)