GPT-5.4 scores 75% on OSWorld, surpassing human performance at 72.4%, as OpenAI fights to reclaim users lost to the Pentagon deal fallout and the #CancelChatGPT movement.

On Wednesday evening, while most of Silicon Valley was still digesting the week's drama around AI military contracts and user boycotts, OpenAI quietly pushed a button. GPT-5.4 went live across ChatGPT, the API, and Codex simultaneously. No countdown. No keynote. Just a blog post and a new model in the dropdown.

The timing was not accidental.

OpenAI has spent the past two weeks watching 1.5 million users walk out the door. The #CancelChatGPT movement sent ChatGPT uninstalls up 295%. Claude briefly hit number one on the App Store. CEO Sam Altman told staff the Pentagon deal backlash was "really painful," admitting publicly that the whole thing "looked opportunistic and sloppy." GPT-5.4, billed as "our most capable and efficient frontier model for professional work," is OpenAI's answer to all of it.

And on paper, it is a serious answer.

The First AI That Operates a Computer Better Than a Human

The headline number is striking and worth sitting with: 75% on OSWorld-Verified, a benchmark that tests whether an AI agent can actually operate a desktop computer. Open applications, click buttons, fill forms, switch between windows, complete multi-step workflows. The kind of work that every knowledge worker does for eight hours a day.

Human performance on that same benchmark sits at 72.4%.

GPT-5.4 is the first general-purpose large language model to cross that line. Its predecessor, GPT-5.2, scored 47.3%. That is not an incremental improvement. It is a 58% jump in a single generation, and it signals something the industry has been racing toward since Anthropic introduced computer use with Claude Opus 4.6 last year: AI agents that do not just talk about work, but do the work.

The computer-use capability is native, meaning it is baked into the model's architecture rather than bolted on through external tooling. GPT-5.4 processes screenshots, issues mouse commands, types keyboard inputs, and handles complex multi-application workflows autonomously. No special agent framework required. No custom infrastructure.

On WebArena-Verified, which tests browser-based navigation, GPT-5.4 scored 67.3% using both DOM and screenshot-driven interaction, up from GPT-5.2's 65.4%.

Three Models in a Trench Coat

OpenAI shipped GPT-5.4 in three variants, each targeting a different use case.

The standard GPT-5.4 is the everyday model. It handles general queries, coding, analysis, and conversation. It prices at $2.50 per million input tokens and $15 per million output tokens, a modest bump from GPT-5.2's $1.75 and $14.

GPT-5.4 Thinking is the reasoning variant. Think of it as the chain-of-thought model that works through problems step by step before answering. It scored 92.8% on GPQA Diamond (scientific reasoning) and 73.3% on ARC-AGI-2 (abstract reasoning), compared to GPT-5.2 Pro's 54.2% on the latter.

GPT-5.4 Pro is the heavy hitter, priced accordingly at $30 per million input tokens and $180 per million output tokens. It is built for sustained, high-stakes professional work: investment banking models, legal analysis, multi-hour research tasks. Pro pushes the reasoning scores even higher: 94.4% on GPQA Diamond, 83.3% on ARC-AGI-2, and 38.0% on FrontierMath Tier 4, a benchmark designed to stump professional mathematicians.

All three share a million-token context window, the largest OpenAI has ever offered. Prompts under 272,000 tokens get standard pricing. Go beyond that, and input costs double while output costs jump 1.5x for the full session.

Model	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)	Best For
GPT-5.4	$2.50	$15.00	General use, coding, conversation
GPT-5.4 Thinking	$2.50	$15.00	Complex reasoning, math, science
GPT-5.4 Pro	$30.00	$180.00	Enterprise workflows, finance, law

The Benchmark Blitz

Beyond computer use, GPT-5.4's numbers paint a picture of broad improvement.

On GDPval, OpenAI's internal benchmark for knowledge work across 44 professional occupations, GPT-5.4 hit 83.0%. GPT-5.2 scored 70.9%. That 12-point gap matters because GDPval tests the kind of work that fills most white-collar days: drafting reports, analyzing data, summarizing documents, building presentations.

When human evaluators compared GPT-5.4's presentation output against earlier models, they preferred the new version 68% of the time.

Error rates dropped meaningfully. Individual claims in GPT-5.4 responses are 33% less likely to be wrong compared to GPT-5.2. Complete answers are 18% less likely to contain any errors at all.

On coding, GPT-5.4 consolidates the capabilities of GPT-5.3 Codex, the dedicated coding model OpenAI shipped in December. SWE-Bench Pro scores hit 57.7%, a modest tick above GPT-5.3 Codex's 56.8%, but the real story is that coding ability now ships inside the general model rather than as a separate product. Developers no longer need to route between different models for different tasks.

The model also introduced what OpenAI calls a "/fast" mode that boosts token generation speed up to 1.5x, addressing the persistent complaint that reasoning models feel slow.

Aug 2025

GPT-5 launches as OpenAI's first post-GPT-4 model

GDPval: 54.2%. The start of the GPT-5 generation.

Dec 2025

GPT-5.2 ships with improved reasoning

OSWorld: 47.3%. GDPval: 70.9%. First major reasoning upgrade.

Dec 2025

GPT-5.3 Codex launches as a dedicated coding model

SWE-Bench Pro: 56.8%. Separate model for code generation.

Mar 5, 2026

GPT-5.4 unifies coding, reasoning, and computer use

OSWorld: 75.0% (surpasses human 72.4%). GDPval: 83.0%. First model to beat humans at operating a computer.

Tool Search Cuts Token Bills by 47%

Buried in the announcement was a feature that will matter more to developers than any benchmark: tool search.

Modern AI applications often define dozens or hundreds of tools (API endpoints, functions, database queries) that a model can call. Until now, every tool's full schema had to be loaded into the prompt on every single request, eating thousands of tokens before the model even read the user's question. For applications with large tool ecosystems, this overhead was brutal.

GPT-5.4's tool search flips the approach. The API receives a lightweight list of available tools, and the model looks up full definitions only when it decides to use one. OpenAI reports a 47% reduction in token consumption for tool-heavy applications.

For teams running thousands of API calls per day, that translates directly to lower bills, even at GPT-5.4's slightly higher per-token pricing.

Wall Street Gets Its Own AI

Alongside GPT-5.4, OpenAI launched ChatGPT for Excel in beta, an add-in that brings GPT-5.4 directly into Microsoft Excel workbooks. Users can build financial models, run scenario analysis, and generate outputs using natural language commands inside their spreadsheets.

The timing is pointed. OpenAI simultaneously announced financial data integrations with Moody's, Dow Jones Factiva, MSCI, Third Bridge, and MT Newswire, with FactSet coming soon. The message to Wall Street is clear: stop copy-pasting between ChatGPT and your spreadsheets.

On OpenAI's internal investment banking benchmark, which tests tasks like building three-statement financial models with proper formatting and citations, GPT-5.4 Thinking scored 87.3%. The original GPT-5 scored 43.7% on the same test. That is a doubling of capability in seven months.

The Excel add-in is initially available to Plus, Team, Enterprise, and Edu subscribers in the United States, Canada, and Australia.

The Competitive Scoreboard Gets Messier

GPT-5.4 enters a market where no single model dominates across all dimensions.

On abstract reasoning, Google's Gemini 3.1 Pro still leads with 94.3% on GPQA Diamond, edging GPT-5.4's 92.8% and Anthropic's Claude Opus 4.6 at 91.3%. On ARC-AGI-2, Gemini again tops the chart at 77.1%, ahead of Opus 4.6's 75.2% and GPT-5.4's 73.3%.

But GPT-5.4 owns computer use and knowledge work. No other model touches 75% on OSWorld. And the 83% GDPval score puts it firmly ahead of both competitors on professional task completion.

On coding, Claude Opus 4.6 still holds the crown with 80.8% on SWE-Bench Verified and strong marks on production-grade bug fixing. GPT-5.4's 57.7% on SWE-Bench Pro is respectable, but the gap remains real.

Benchmark	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
OSWorld-Verified	75.0%	N/A	N/A
GPQA Diamond	92.8%	91.3%	94.3%
ARC-AGI-2	73.3%	75.2%	77.1%
SWE-Bench Verified	N/A	80.8%	N/A
GDPval	83.0%	N/A	N/A

The picture that emerges is one of specialization. GPT-5.4 is the best model for autonomous computer operation and professional knowledge work. Claude Opus 4.6 remains the coding leader. Gemini 3.1 Pro offers the strongest pure reasoning at the best price point ($2 per million input tokens, cheapest of the three).

The Cybersecurity Elephant in the Room

GPT-5.4's system card contains a notable first: it is the first general-purpose model that OpenAI has classified as "High Capability" for cybersecurity. That is not a boast. It is a warning label.

The classification means GPT-5.4 is capable enough at offensive cybersecurity tasks that OpenAI built dedicated mitigations directly into the model. A two-tier monitoring system runs in real time: a fast topic classifier identifies whether a query touches cybersecurity territory, and a secondary AI security analyst determines whether the specific response falls within acceptable bounds.

OpenAI trained GPT-5.4 to provide helpful guidance on defensive cybersecurity while refusing operational instructions for malware creation, credential theft, and chained exploitation. The company deployed expanded monitoring, trusted access controls, and asynchronous blocking for high-risk requests.

This is the tension at the heart of frontier AI development. The same capability that lets GPT-5.4 autonomously operate a computer also makes it a more potent tool for bad actors. OpenAI's response is to ship the capability with guardrails rather than withhold it entirely.

Developers Are Cautiously Impressed

The developer community's reaction has been measured. On forums and social media, the dominant sentiment is not excitement about raw intelligence gains but appreciation for practical improvements.

"The raw logical reasoning does not feel dramatically smarter with each version," wrote developer Lars Nietvelt on DEV Community. "But what does improve is how much better these models get at understanding what you are asking for. That is useful. But it is not the same as becoming more intelligent."

The million-token context window drew genuine enthusiasm. Developers building AI tooling noted that feeding entire codebases, documentation chains, and log files into a single query "unlocks things that were not feasible before."

Multiple developers reported that GPT-5.4 finally resolved the persistent "lazy model" problem, where earlier versions would stall halfway through complex tasks or skip steps. Execution speed reportedly doubled compared to GPT-5.3.

OpenAI researcher Noam Brown pushed back on any narrative of slowing progress. "We see no wall," Brown stated, "and expect AI capabilities to continue to increase dramatically this year."

A Win OpenAI Desperately Needed

Gizmodo's headline captured the subtext that OpenAI could not say out loud: "OpenAI, in Desperate Need of a Win, Launches GPT-5.4."

The context matters. OpenAI's Pentagon partnership triggered the largest user exodus in the company's history. Anthropic's very public refusal to compromise its safety guardrails for military work created a stark contrast that users rewarded with their wallets and their app downloads. Altman acknowledged to staff that the backlash was "really painful" and that the deal "looked opportunistic and sloppy."

GPT-5.4 does not make any of that go away. But it does give OpenAI something concrete to point to: a model that can operate a computer better than a person, that hallucinates less, that costs less per task despite higher per-token pricing, and that comes bundled with the kind of enterprise financial tools that generate real revenue.

GPT-5.2 Thinking will remain available for three months before being retired on June 5, 2026, giving teams time to migrate.

The Bottom Line

GPT-5.4 is a genuinely strong model wrapped in a genuinely complicated moment. The benchmarks are real. The computer-use capability is a legitimate milestone, one that puts OpenAI ahead of Anthropic and Google in the specific race to build AI agents that can operate software autonomously. The financial integrations signal a company pivoting hard toward enterprise revenue.

But technology does not exist in a vacuum. OpenAI is shipping its most capable model at the exact moment its brand is most damaged. The question is whether performance can outrun reputation, whether developers and enterprises care more about OSWorld scores than Pentagon contracts.

Noam Brown says there is no wall. The users who left suggest there might be one that benchmarks cannot measure.

Sources

Introducing GPT-5.4 | OpenAI (March 5, 2026)
OpenAI launches GPT-5.4 with Pro and Thinking versions | TechCrunch (March 5, 2026)
OpenAI launches GPT-5.4, its most powerful model for enterprise work | Fortune (March 5, 2026)
OpenAI, in Desperate Need of a Win, Launches GPT-5.4 | Gizmodo (March 5, 2026)
GPT-5.4 Thinking System Card | OpenAI (March 5, 2026)
OpenAI launches GPT-5.4 Thinking and Pro | The Decoder (March 5, 2026)
Introducing ChatGPT for Excel | OpenAI (March 5, 2026)
GPT-5.4 dropped. The hype is not fully justified, but the shift is real | DEV Community (March 6, 2026)
OpenAI's Altman takes jabs at Anthropic | CNBC (March 5, 2026)
OpenAI launches GPT-5.4 with computer vision, tool use enhancements | SiliconANGLE (March 5, 2026)

Practice with real Logistics & Shipping data

90 SQL & Python problems · 15 industry datasets

Used by DS/ML engineers at top companies

High-Value Overnight OrdersEasy

Delivered International ShipmentsMedium

On-Time Delivery Rate by CarrierHard

250 free problems · No credit card

See all Logistics & Shipping problems

Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths

Recommended Reading

Curated articles related to this topic

News

7 min

GPT-5.3 Codex: OpenAI Just Released an AI That Helped Build Itself

GPT-5.3 Codex represents OpenAI's most significant advancement in agentic coding, defined by its ability to debug its own training data and manage deployment processes. This model achieves a record-breaking 77.3% on Terminal-Bench 2.0 and 64.7% on OSWorld, surpassing Claude Opus 4.6 by nearly 12 percentage points in agentic tasks. GPT-5.3 Codex runs on NVIDIA GB200 NVL72 systems, offering 25% faster inference speeds while consuming half the tokens of GPT-5.2 Codex. The architecture integrates reasoning capabilities directly with code generation, positioning the tool as a work-on-a-computer agent rather than a simple code completion assistant. Security researchers have classified GPT-5.3 Codex as the first high-risk cybersecurity model due to these autonomous capabilities. Developers and data scientists can now deploy GPT-5.3 Codex through the CLI, IDE extensions, or web interface to automate complex, multi-step software engineering workflows.

Claude Opus 4.6: Anthropic Just Dropped Its Most Intelligent Model and Wall Street Is Paying Attention

Claude Opus 4.6 represents Anthropic's significant leap in artificial intelligence, introducing a one-million token context window and agent teams for parallel processing. The model outperforms GPT-5.2 on major benchmarks, including GDPval-AA for economic analysis and Terminal-Bench 2.0 for coding tasks. Developers can access Claude Opus 4.6 via the API model ID claude-opus-4-6, Amazon Bedrock, Google Cloud Vertex AI, and Snowflake Cortex AI. A key innovation is the agent teams architecture, which allows multiple AI instances to collaborate simultaneously on complex workflows like codebase reviews and large refactors, distinct from single-threaded agents. The upgrade includes adaptive thinking modes with four effort levels and auto-compaction for context management. By leveraging these advancements, software engineers and data scientists can automate enterprise-grade knowledge work and deploy multi-agent systems that handle distinct modules of a project concurrently.

China's GLM-5: The 744B Open-Source Model Trained Entirely on Huawei Chips

China's GLM-5 represents a pivotal moment in sovereign AI development as a 744-billion parameter open-source model trained exclusively on 100,000 Huawei Ascend 910B chips using the MindSpore framework. Zhipu AI, rebranded as Z.ai, utilized a Mixture-of-Experts architecture where only 40 billion parameters activate per token, achieving computational efficiency comparable to dense models at a fraction of the inference cost. The model incorporates Multi-head Latent Attention to compress key-value pairs for memory reduction and DeepSeek Sparse Attention to manage a 200,000-token context window without hardware bottlenecks. By implementing Multi-token Prediction with three additional prediction layers, GLM-5 triples output generation speed through an average acceptance length of 2.76 tokens per step. Released under the permissive MIT license, GLM-5 challenges US semiconductor sanctions by proving competitive performance against GPT-5.2 and Claude Opus 4.5 is possible without NVIDIA hardware. Developers can deploy this multi-modal system for text, image, and video processing to reduce API costs significantly while bypassing Western hardware dependencies.

Audio

Feb 26, 2026

LLM FundamentalsIntermediate

9 min

Open Source vs Closed LLMs: Choosing the Right Model in 2026

The architectural decision between open source and closed Large Language Models in 2026 depends on specific deployment needs rather than a binary quality gap. DeepSeek V3 and DeepSeek R1 proved that open weights can match proprietary systems like OpenAI o1 and GPT-4o on MMLU and MATH-500 benchmarks through efficient Multi-Head Latent Attention and Group Relative Policy Optimization. While open models like Alibaba Qwen 3 offer flexible Apache 2.0 licensing and hybrid thinking modes, closed ecosystems like Gemini 3 Pro and Claude Sonnet 4.5 maintain advantages in production coding and complex instruction following. Developers must weigh the capital efficiency of FP8 mixed-precision training and self-hosting against the operational simplicity of managed APIs. Data scientists can use this framework to select the correct model architecture by analyzing reasoning capabilities, total cost of ownership, and specific performance metrics like AIME scores.

Audio

Feb 11, 2026

LLM FundamentalsIntermediate

11 min

Reasoning Models: How AI Learned to Think Step by Step

Reasoning models represent a fundamental shift in artificial intelligence from standard next-token prediction to deliberate, step-by-step problem solving. OpenAI's o1-preview and o3 models demonstrate this evolution by pausing to plan, critique logic, and backtrack through errors, effectively simulating System 2 human thinking rather than the rapid, intuitive System 1 processing of traditional Large Language Models like GPT-4o. This architectural change relies on reinforcement learning to internalize chain-of-thought mechanisms, where intermediate computational steps optimize the probability of a correct final answer rather than just probable next words. Techniques like Chain-of-Thought prompting and Zero-shot Chain-of-Thought reveal that latent reasoning capabilities exist within pre-trained models when activated by specific instructions like 'Let's think step by step.' Developers and data scientists can leverage these models to solve complex mathematical proofs, coding challenges, and logic puzzles that stumped previous architectures. By understanding the distinction between training-time compute and test-time compute, engineers can better architect AI systems that balance generation speed with the depth of logical verification required for high-stakes applications.

Audio

News

10 min

Google Just Dropped Gemini 3.1 Pro and the AI Race Just Got a Lot More Interesting

Google's unannounced release of Gemini 3.1 Pro on Vertex AI redefines expectations for agentic model performance by directly addressing the hallucination and consistency issues found in Gemini 3 Pro. The Gemini 3.1 Pro update delivers substantial improvements in multi-step tool execution, reasoning coherence, and instruction adherence, positioning the model as a superior alternative to Claude Opus 4.6 and GPT-5.3 Codex for technical tasks. Early community benchmarks highlight the ability of Gemini 3.1 Pro to handle complex generation tasks, such as creating a functional Windows 11-style web operating system or a 3D browser game in a single prompt. The release signifies a strategic shift toward API-first deployment, prioritizing developer utility over press events. Data scientists and AI engineers can leverage the new model ID gemini-3.1-pro to deploy high-fidelity agentic workflows that require minimal iterative debugging compared to previous Google model iterations.

Feb 19, 2026

News

8 min

Anthropic Launches "Claude Cowork": The Agent That Lives on Your Desktop

Anthropic's release of Claude Cowork redefines local AI assistance by enabling the Claude macOS app to execute complex file operations directly on user desktops. This agentic feature, available to Claude Pro and Claude Max subscribers, evolved from user hacks of Claude Code and allows data professionals to automate tasks like spreadsheet generation from receipt images or cleaning download directories. Built by a small team utilizing Claude Code itself, the Cowork architecture employs local Virtual Machine sandboxing for security, ensuring write permissions remain restricted to specified folders. The integration with the Claude in Chrome extension permits the agent to retrieve external web data during local execution. Alongside Cowork, the introduction of Opus 4.5 and Claude for Healthcare expands Anthropic's capabilities in extended thinking and HIPAA-compliant data processing. Data scientists can leverage these tools to transition from simple RAG retrieval pipelines to fully autonomous execution workflows that handle file management and data synthesis without manual intervention.

CancelChatGPT: Users Are Mass-Migrating to Claude After OpenAI's Pentagon Deal

The CancelChatGPT movement represents the first significant consumer revolt in artificial intelligence history, triggered by OpenAI's collaboration with the U.S. Department of War and executive donations to political campaigns. OpenAI experienced a 295% surge in daily uninstalls and a 775% increase in one-star App Store reviews following the announcement of a classified military network deployment deal on February 27, 2026. This backlash reshaped the AI market landscape, propelling Anthropic's Claude to the number one spot on the U.S. App Store as users sought alternatives to ChatGPT-4. The controversy originated from earlier revelations regarding OpenAI President Greg Brockman's $25 million donation to MAGA Inc. and U.S. Immigration and Customs Enforcement's use of ChatGPT for resume screening. Activist groups like QuitGPT mobilized over 700,000 supporters, demonstrating that ethical alignment and corporate governance directly impact AI product adoption and user retention. Technology professionals and policy makers can analyze these events to understand how geopolitical contracts and political affiliations now serve as critical risk factors for consumer-facing artificial intelligence platforms.

OpenClaw: The AI Agent That Broke the Internet

OpenClaw represents a paradigm shift from passive chatbots to autonomous local agents capable of executing complex workflows on personal hardware. Created by Austrian engineer Peter Steinberger as a weekend project in November 2025, the open-source tool rapidly accrued 135,000 GitHub stars by February 2026. The agent distinguishes itself through local hosting architecture, model-agnostic routing compatible with Anthropic and OpenAI APIs, and multi-channel integration across platforms like WhatsApp and Telegram. Following a trademark dispute regarding the original name Clawdbot, the project evolved into OpenClaw and spawned Moltbook, a social network exclusively for AI interaction. Security experts and industry figures like Andrej Karpathy highlight the tool's rapid adoption and potential risks. Developers and data scientists can leverage OpenClaw to build private, task-executing agents that manage files, emails, and command-line operations without exposing sensitive data to cloud providers.

Audio

Feb 2, 2026

LLM FundamentalsAdvanced

6 min

Long Context Models: Working with 1M+ Token Windows

Long context models like Llama 4 Scout and Gemini 2.5 Pro represent a fundamental shift in AI capability by processing sequence lengths exceeding 1 million tokens. The transition from standard 512-token limits to massive context windows requires overcoming the quadratic attention bottleneck, where doubling input length quadruples computational cost. While architectures like Mixture-of-Experts and techniques such as interleaved Rotary Position Embeddings enable massive input ingestion, benchmarks like RULER demonstrate that retrieval accuracy often degrades before reaching advertised limits. Effectively deploying systems built on GPT-4.1 or DeepSeek V3 necessitates understanding the distinction between maximum input capacity and effective reasoning depth. Flash Attention serves as a critical optimization, preventing the materialization of terabyte-sized attention matrices. Machine learning engineers can evaluate model performance on extended sequences and select the correct architecture for production systems requiring deep retrieval over massive datasets.

Audio

Feb 11, 2026