GPT-5.3 Codex: OpenAI Just Released an AI That Helped Build Itself

DS
LDS Team
Let's Data Science
7 min readAudio
Listen Along
0:00 / 0:00
AI voice

25% faster, half the tokens, record-breaking benchmarks, and the first model OpenAI classifies as a "high" cybersecurity risk

By LDS Team February 6, 2026

Twenty minutes. That is how long OpenAI waited after Anthropic unveiled Claude Opus 4.6 before dropping its own bombshell. On February 5, 2026—just minutes after its biggest rival's announcement—OpenAI released GPT-5.3 Codex, a model it calls the most capable agentic coding AI ever built.

But the timing is not the story. The story is that GPT-5.3 Codex helped debug its own training, manage its own deployment, and evaluate its own test results. OpenAI's Codex team says they were "blown away by how much Codex was able to accelerate its own development." This is an AI that played a role in creating itself.

It is available now to all paid ChatGPT users across the Codex app, CLI, IDE extension, and web.

What Is GPT-5.3 Codex?

GPT-5.3 Codex is OpenAI's most capable agentic coding model, combining the coding performance of GPT-5.2 Codex with the reasoning and knowledge capabilities of GPT-5.2—in one model that runs 25% faster and uses less than half the tokens.

Where previous Codex models were narrowly focused on code generation, GPT-5.3 Codex marks a shift toward something broader: an AI agent that can take on long-running tasks involving research, tool use, and complex multi-step execution. OpenAI is positioning it not just as a coding assistant, but as a "work-on-a-computer" agent.

FeatureGPT-5.2 CodexGPT-5.3 Codex
Terminal-Bench 2.064.0%77.3% (+13.3 points)
OSWorld38.2%64.7% (+26.5 points)
Token efficiencyBaselineLess than half the tokens
SpeedBaseline25% faster inference
Self-developmentNoYes — helped build itself
Cybersecurity riskBelow "High"First model rated "High"

The model was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems—a detail that signals the tight hardware-software co-optimization behind these gains.

The Self-Developing Model: What That Actually Means

The headline that GPT-5.3 Codex "helped build itself" sounds dramatic. Here is what actually happened.

During development, the Codex team began using early versions of the model for three specific tasks:

  1. Bug detection — Finding errors during the training phase
  2. Deployment management — Handling operational aspects of the rollout
  3. Results evaluation — Diagnosing test results and benchmark performance

This is not artificial general intelligence bootstrapping itself into existence. It is a practical demonstration that the model is good enough at coding and debugging that OpenAI's own engineers found it useful for building the next version. The distinction matters—but the implication is real. If a coding model can meaningfully accelerate its own development, the pace of future improvements could compound.

As OpenAI put it: GPT-5.3 Codex is the "first self-developing" AI coding model.

The Benchmarks: Where GPT-5.3 Codex Stands

GPT-5.3 Codex sets new records across multiple coding and agent evaluations.

BenchmarkWhat It MeasuresGPT-5.3 CodexPrevious Best
Terminal-Bench 2.0Agentic coding and system tasks77.3%65.4% (Claude Opus 4.6)
SWE-Bench Pro (Public)Real-world software engineeringNew industry highGPT-5.2 Codex
OSWorldComputer-use agent tasks64.7%38.2% (GPT-5.2 Codex)
GDPvalEconomically valuable knowledge workMatches GPT-5.2

The Terminal-Bench result is the standout. At 77.3%, GPT-5.3 Codex outperforms Claude Opus 4.6 (65.4%) by nearly 12 percentage points on the benchmark specifically designed to measure agentic coding—how well a model can independently execute complex coding and system tasks.

The OSWorld jump tells a different but equally important story. This benchmark measures how well an AI can operate a computer—clicking, typing, navigating applications, completing multi-tool tasks. A 26.5 percentage point improvement in a single generation is remarkable.

One caveat worth noting: Anthropic and OpenAI report on different SWE-Bench variants. Anthropic uses SWE-Bench Verified; OpenAI uses SWE-Bench Pro Public. These have different problem sets, so direct score comparisons across variants are not valid. Both companies lead on their respective variants.

The efficiency gains may matter even more than the raw scores. GPT-5.3 Codex uses less than half the tokens of its predecessor for equivalent tasks, with 25% faster inference per token. For developers running agents on long tasks—codebase refactors, multi-file debugging, automated testing pipelines—fewer tokens means lower costs and longer effective context. OpenAI states it plainly: the model "does more with fewer tokens than any prior model."

The Cybersecurity Question: Why OpenAI Is Being Cautious

Here is where the story takes an unusual turn. OpenAI is simultaneously celebrating GPT-5.3 Codex as its best model and flagging it as its most dangerous.

CEO Sam Altman posted on X: "GPT-5.3-Codex is our first model that hits 'high' for cybersecurity on our preparedness framework."

This means GPT-5.3 Codex is the first model OpenAI classifies as capable of meaningfully enabling real-world cyber harm—particularly if automated or used at scale. It is also the first model directly trained to identify software vulnerabilities.

What OpenAI Is Doing About It

Rather than an unrestricted rollout, OpenAI has implemented what it calls its "most comprehensive cybersecurity safety stack to date":

SafeguardDetails
Trusted Access FrameworkAdvanced capabilities gated behind a vetted access program
$10 million in API creditsCommitted to cyber defense for open source and critical infrastructure
Safety trainingAdditional protocols for cybersecurity use cases
Automated monitoringContinuous surveillance for misuse patterns
Enforcement pipelinesThreat intelligence integrated into real-time response
Restricted API accessFull rollout delayed until safety measures are validated

OpenAI acknowledged it lacks "definitive evidence" the model can fully automate cyberattacks end-to-end. But the precautionary approach—deploying safeguards before confirmed harm rather than after—marks a notable shift in how frontier AI labs handle dual-use capabilities.

Frontier: OpenAI's New Enterprise Agent Platform

Alongside GPT-5.3 Codex, OpenAI introduced Frontier—a platform that lets companies build and manage AI agents without writing code.

Workers create agents by typing natural language descriptions into a ChatGPT-like interface, connect them to enterprise applications (CRMs, data warehouses, internal tools), and deploy. The platform includes custom skills for multi-step tasks, persistent memory so agents improve over time, a monitoring dashboard, and audit logging with quality scoring.

Frontier is currently available to a limited group of enterprise customers including Oracle and HP, with startup partners like Clay Labs and Ambience Healthcare. OpenAI provides "forward deployed engineers" to help customers develop best practices. Broader availability is planned for the coming months.

Availability and Pricing

PlatformStatus
Codex AppAvailable now
Codex CLIAvailable now
IDE ExtensionAvailable now
Web (ChatGPT)Available now
APIComing soon (delayed for safety review)

You can steer and interact with the model while it works on long-running tasks—much like directing a colleague—without losing context. It handles agentic tasks that span hours or even days.

Pricing: Official API pricing has not been disclosed. Access is included with paid ChatGPT plans (Plus, Pro, Team, Enterprise).

GPT-5.3 Codex vs. Claude Opus 4.6: Who Wins?

These two models launched within 20 minutes of each other, during a week where both companies are also set to air competing Super Bowl ads on February 9. The AI coding wars are officially on.

But neither model dominates across the board:

CategoryLeaderWhy
Agentic coding (Terminal-Bench)GPT-5.3 Codex (77.3%)12 points ahead of Opus 4.6
Computer-use tasks (OSWorld)GPT-5.3 Codex (64.7%)Massive generational leap
Knowledge work (GDPval)Claude Opus 4.6+144 Elo over GPT-5.2
Long-context retrievalClaude Opus 4.61M token context window
Speed and token efficiencyGPT-5.3 Codex25% faster, half the tokens
Multi-agent collaborationClaude Opus 4.6Agent teams in Claude Code
Enterprise agent platformGPT-5.3 CodexFrontier platform

The picture is one of specialization. Codex leads in speed, autonomous code execution, and computer-use tasks. Opus 4.6 leads in deep reasoning, long-context analysis, and multi-agent orchestration. Developers are not choosing between a better and worse model—they are choosing between different strengths.

The Bottom Line

GPT-5.3 Codex is not just a faster coding model. It is an AI that helped build itself, that operates a computer well enough to score 64.7% on OSWorld, and that OpenAI considers dangerous enough to delay its own API rollout.

The self-developing aspect is the most forward-looking detail. Today, it debugged training runs and evaluated benchmarks. The question the industry is now asking: what does the next version debug?

OpenAI is betting that the future of software engineering is agentic—AI systems that do not just suggest code but autonomously execute complex, multi-step tasks across tools and environments. GPT-5.3 Codex is the strongest evidence yet that this future is arriving faster than most predicted.

Sources