25% faster, half the tokens, record-breaking benchmarks, and the first model OpenAI classifies as a "high" cybersecurity risk
By LDS Team February 6, 2026
Twenty minutes. That is how long OpenAI waited after Anthropic unveiled Claude Opus 4.6 before dropping its own bombshell. On February 5, 2026—just minutes after its biggest rival's announcement—OpenAI released GPT-5.3 Codex, a model it calls the most capable agentic coding AI ever built.
But the timing is not the story. The story is that GPT-5.3 Codex helped debug its own training, manage its own deployment, and evaluate its own test results. OpenAI's Codex team says they were "blown away by how much Codex was able to accelerate its own development." This is an AI that played a role in creating itself.
It is available now to all paid ChatGPT users across the Codex app, CLI, IDE extension, and web.
What Is GPT-5.3 Codex?
GPT-5.3 Codex is OpenAI's most capable agentic coding model, combining the coding performance of GPT-5.2 Codex with the reasoning and knowledge capabilities of GPT-5.2—in one model that runs 25% faster and uses less than half the tokens.
Where previous Codex models were narrowly focused on code generation, GPT-5.3 Codex marks a shift toward something broader: an AI agent that can take on long-running tasks involving research, tool use, and complex multi-step execution. OpenAI is positioning it not just as a coding assistant, but as a "work-on-a-computer" agent.
| Feature | GPT-5.2 Codex | GPT-5.3 Codex |
|---|---|---|
| Terminal-Bench 2.0 | 64.0% | 77.3% (+13.3 points) |
| OSWorld | 38.2% | 64.7% (+26.5 points) |
| Token efficiency | Baseline | Less than half the tokens |
| Speed | Baseline | 25% faster inference |
| Self-development | No | Yes — helped build itself |
| Cybersecurity risk | Below "High" | First model rated "High" |
The model was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems—a detail that signals the tight hardware-software co-optimization behind these gains.
The Self-Developing Model: What That Actually Means
The headline that GPT-5.3 Codex "helped build itself" sounds dramatic. Here is what actually happened.
During development, the Codex team began using early versions of the model for three specific tasks:
- Bug detection — Finding errors during the training phase
- Deployment management — Handling operational aspects of the rollout
- Results evaluation — Diagnosing test results and benchmark performance
This is not artificial general intelligence bootstrapping itself into existence. It is a practical demonstration that the model is good enough at coding and debugging that OpenAI's own engineers found it useful for building the next version. The distinction matters—but the implication is real. If a coding model can meaningfully accelerate its own development, the pace of future improvements could compound.
As OpenAI put it: GPT-5.3 Codex is the "first self-developing" AI coding model.
The Benchmarks: Where GPT-5.3 Codex Stands
GPT-5.3 Codex sets new records across multiple coding and agent evaluations.
| Benchmark | What It Measures | GPT-5.3 Codex | Previous Best |
|---|---|---|---|
| Terminal-Bench 2.0 | Agentic coding and system tasks | 77.3% | 65.4% (Claude Opus 4.6) |
| SWE-Bench Pro (Public) | Real-world software engineering | New industry high | GPT-5.2 Codex |
| OSWorld | Computer-use agent tasks | 64.7% | 38.2% (GPT-5.2 Codex) |
| GDPval | Economically valuable knowledge work | Matches GPT-5.2 | — |
The Terminal-Bench result is the standout. At 77.3%, GPT-5.3 Codex outperforms Claude Opus 4.6 (65.4%) by nearly 12 percentage points on the benchmark specifically designed to measure agentic coding—how well a model can independently execute complex coding and system tasks.
The OSWorld jump tells a different but equally important story. This benchmark measures how well an AI can operate a computer—clicking, typing, navigating applications, completing multi-tool tasks. A 26.5 percentage point improvement in a single generation is remarkable.
One caveat worth noting: Anthropic and OpenAI report on different SWE-Bench variants. Anthropic uses SWE-Bench Verified; OpenAI uses SWE-Bench Pro Public. These have different problem sets, so direct score comparisons across variants are not valid. Both companies lead on their respective variants.
The efficiency gains may matter even more than the raw scores. GPT-5.3 Codex uses less than half the tokens of its predecessor for equivalent tasks, with 25% faster inference per token. For developers running agents on long tasks—codebase refactors, multi-file debugging, automated testing pipelines—fewer tokens means lower costs and longer effective context. OpenAI states it plainly: the model "does more with fewer tokens than any prior model."
The Cybersecurity Question: Why OpenAI Is Being Cautious
Here is where the story takes an unusual turn. OpenAI is simultaneously celebrating GPT-5.3 Codex as its best model and flagging it as its most dangerous.
CEO Sam Altman posted on X: "GPT-5.3-Codex is our first model that hits 'high' for cybersecurity on our preparedness framework."
This means GPT-5.3 Codex is the first model OpenAI classifies as capable of meaningfully enabling real-world cyber harm—particularly if automated or used at scale. It is also the first model directly trained to identify software vulnerabilities.
What OpenAI Is Doing About It
Rather than an unrestricted rollout, OpenAI has implemented what it calls its "most comprehensive cybersecurity safety stack to date":
| Safeguard | Details |
|---|---|
| Trusted Access Framework | Advanced capabilities gated behind a vetted access program |
| $10 million in API credits | Committed to cyber defense for open source and critical infrastructure |
| Safety training | Additional protocols for cybersecurity use cases |
| Automated monitoring | Continuous surveillance for misuse patterns |
| Enforcement pipelines | Threat intelligence integrated into real-time response |
| Restricted API access | Full rollout delayed until safety measures are validated |
OpenAI acknowledged it lacks "definitive evidence" the model can fully automate cyberattacks end-to-end. But the precautionary approach—deploying safeguards before confirmed harm rather than after—marks a notable shift in how frontier AI labs handle dual-use capabilities.
Frontier: OpenAI's New Enterprise Agent Platform
Alongside GPT-5.3 Codex, OpenAI introduced Frontier—a platform that lets companies build and manage AI agents without writing code.
Workers create agents by typing natural language descriptions into a ChatGPT-like interface, connect them to enterprise applications (CRMs, data warehouses, internal tools), and deploy. The platform includes custom skills for multi-step tasks, persistent memory so agents improve over time, a monitoring dashboard, and audit logging with quality scoring.
Frontier is currently available to a limited group of enterprise customers including Oracle and HP, with startup partners like Clay Labs and Ambience Healthcare. OpenAI provides "forward deployed engineers" to help customers develop best practices. Broader availability is planned for the coming months.
Availability and Pricing
| Platform | Status |
|---|---|
| Codex App | Available now |
| Codex CLI | Available now |
| IDE Extension | Available now |
| Web (ChatGPT) | Available now |
| API | Coming soon (delayed for safety review) |
You can steer and interact with the model while it works on long-running tasks—much like directing a colleague—without losing context. It handles agentic tasks that span hours or even days.
Pricing: Official API pricing has not been disclosed. Access is included with paid ChatGPT plans (Plus, Pro, Team, Enterprise).
GPT-5.3 Codex vs. Claude Opus 4.6: Who Wins?
These two models launched within 20 minutes of each other, during a week where both companies are also set to air competing Super Bowl ads on February 9. The AI coding wars are officially on.
But neither model dominates across the board:
| Category | Leader | Why |
|---|---|---|
| Agentic coding (Terminal-Bench) | GPT-5.3 Codex (77.3%) | 12 points ahead of Opus 4.6 |
| Computer-use tasks (OSWorld) | GPT-5.3 Codex (64.7%) | Massive generational leap |
| Knowledge work (GDPval) | Claude Opus 4.6 | +144 Elo over GPT-5.2 |
| Long-context retrieval | Claude Opus 4.6 | 1M token context window |
| Speed and token efficiency | GPT-5.3 Codex | 25% faster, half the tokens |
| Multi-agent collaboration | Claude Opus 4.6 | Agent teams in Claude Code |
| Enterprise agent platform | GPT-5.3 Codex | Frontier platform |
The picture is one of specialization. Codex leads in speed, autonomous code execution, and computer-use tasks. Opus 4.6 leads in deep reasoning, long-context analysis, and multi-agent orchestration. Developers are not choosing between a better and worse model—they are choosing between different strengths.
The Bottom Line
GPT-5.3 Codex is not just a faster coding model. It is an AI that helped build itself, that operates a computer well enough to score 64.7% on OSWorld, and that OpenAI considers dangerous enough to delay its own API rollout.
The self-developing aspect is the most forward-looking detail. Today, it debugged training runs and evaluated benchmarks. The question the industry is now asking: what does the next version debug?
OpenAI is betting that the future of software engineering is agentic—AI systems that do not just suggest code but autonomously execute complex, multi-step tasks across tools and environments. GPT-5.3 Codex is the strongest evidence yet that this future is arriving faster than most predicted.
Sources
- OpenAI Official Announcement: Introducing GPT-5.3-Codex
- OpenAI GPT-5.3-Codex System Card
- Fortune: OpenAI's new model leaps ahead in coding—but raises unprecedented cybersecurity risks
- TechCrunch: OpenAI launches new agentic coding model only minutes after Anthropic
- VentureBeat: AI coding wars heat up ahead of Super Bowl ads
- NBC News: OpenAI says new Codex coding model helped build itself
- The Decoder: GPT-5.3-Codex helped build itself during training and deployment
- SiliconANGLE: OpenAI introduces Frontier platform and GPT-5.3-Codex
- Neowin: GPT-5.3-Codex — 25% faster and setting new coding benchmark records
- WinBuzzer: OpenAI Launches GPT-5.3-Codex as the "First Self-Developing" AI Coding Model