OpenAI Previews GPT-5.6 Sol: A New Naming System, Subagent Ultra Mode, and a Government-Coordinated Limited Release

Article Overview

OpenAI just previewed its next generation of models, and there is a lot packed into this one announcement. Three new models arrive at once — Sol, Terra, and Luna — built on an entirely new naming structure that separates "generation" from "capability tier" for the first time. Sol sets a new state-of-the-art score on a major coding benchmark, while quietly matching a previous frontier cybersecurity model using roughly a third of the compute. There is a brand new "Ultra" mode that runs multiple AI subagents at once to tackle harder problems faster.

But the part of this announcement that will get the most attention is not a benchmark number. OpenAI is rolling out GPT-5.6 in a limited preview specifically because the US government asked them to, before expanding to everyone else in the coming weeks. OpenAI says this directly and clearly states it does not want this kind of government-coordinated access to become permanent.

This article breaks down everything in the announcement — what Sol, Terra, and Luna actually are, how they perform against GPT-5.5 and against Anthropic's Claude models, what the new four-layer safety system actually does, and why a 700,000-GPU-hour red-teaming effort is the largest of its kind OpenAI has disclosed.

Introduction

Most AI model previews follow a familiar script: a new model arrives, benchmark charts go up, pricing gets announced, and the rollout begins. GPT-5.6 Sol follows that script in parts — but it also breaks from it in a way that is genuinely unusual for the industry.

OpenAI is introducing not one model but three, organized under a naming system the company has never used before. It is launching with what it describes as its most robust safety stack to date, built specifically because the model's capabilities took a meaningful step forward in cybersecurity and biology. And it is doing all of this through a limited preview that OpenAI has explicitly tied to coordination with the US government — a detail buried in the middle of the announcement that is arguably more newsworthy than any benchmark score in the piece.

Here is the complete picture, broken into what actually matters.

Meet the Family: Sol, Terra, and Luna

GPT-5.6 introduces three models at once, each occupying a distinct position on the capability and cost spectrum.

Sol is the flagship — OpenAI's most capable model in this generation, designed for the hardest tasks across coding, biology, and cybersecurity.

Terra is built for everyday work. According to OpenAI, it delivers performance competitive with GPT-5.5 while costing half as much to run.

Luna is the fast and affordable option, designed to bring meaningful capability at the lowest price point in the lineup.

What makes this release structurally different from previous OpenAI launches is the naming logic behind it. In every prior generation, the number identified the model's overall capability, and any modifier — mini, pro, turbo — simply scaled that capability up or down. With GPT-5.6, OpenAI has split these two ideas apart. The number 5.6 identifies the generation. Sol, Terra, and Luna identify capability tiers that, going forward, can be updated and improved on their own independent schedule rather than waiting for the entire generation to move together.

In practice, this means a future improvement to Luna's efficiency does not require waiting for a full version bump across the entire GPT-5.6 family. The tiers move at their own pace. OpenAI describes the goal as giving people and developers clearer choices across intelligence, speed, and cost — three variables that used to be bundled together in a single model number and are now explicitly separated.

What Sol Can Actually Do

A New Mode for Deeper Thinking

GPT-5.6 introduces a max reasoning effort setting, giving Sol the most time it has ever had to reason through a problem before responding. This sits at the top of OpenAI's existing reasoning effort scale, reserved for the hardest problems where taking longer produces meaningfully better answers.

Ultra Mode: Multiple Agents Working Together

The more structurally interesting addition is Ultra mode. Rather than relying on a single model instance working through a task step by step, Ultra mode deploys subagents — additional AI instances that work in parallel on different parts of a complex problem — to accelerate the overall task. This shows up in benchmark results as a distinct entry, "GPT-5.6 Sol Ultra," which consistently outperforms the standard Sol configuration.

The practical idea behind subagents is straightforward: some problems are not best solved by one model thinking longer, but by several models working on different angles of the same problem simultaneously and combining their findings. Ultra mode is OpenAI's first broad implementation of that approach at the model tier level rather than as a separate orchestration tool developers build themselves.

The Benchmarks: How Sol Actually Performs

Terminal-Bench 2.1 — Command-Line Coding Workflows

Terminal-Bench 2.1 tests something specific and practically demanding: can an AI agent plan, iterate, and coordinate tools correctly while working through command-line based software tasks. This is closer to how a real engineer works inside a terminal than most coding benchmarks, which tend to test isolated code snippets.

GPT-5.6 Sol sets a new state-of-the-art result on this benchmark. The full competitive picture across models:

Model	Terminal-Bench 2.1 Score
GPT-5.6 Sol Ultra	91.9%
GPT-5.5	88.0%
GPT-5.6 Sol	88.8%
Claude Mythos 5	84.3%
GPT-5.6 Luna	84.3%
Claude Fable 5	83.4%
GPT-5.6 Terra	82.5%
Claude Opus 4.8	78.9%
Gemini 3.1 Pro Preview	70.7%

A few things stand out here. GPT-5.6 Sol Ultra, using subagents, posts the clear top score at 91.9% — a meaningful gap above every other model tested. Standard GPT-5.6 Sol at 88.8% is only a modest improvement over the prior-generation GPT-5.5 at 88.0%, which suggests that for this specific benchmark, the bigger leap comes from Ultra mode's subagent architecture rather than the base model upgrade alone. Notably, GPT-5.6 Luna — the cheapest model in the new lineup — matches Claude Mythos 5's score of 84.3%, which is a striking result for what is positioned as OpenAI's budget-tier offering.

GeneBench v1 — Long-Horizon Biology Analysis

GeneBench v1 evaluates genomics and quantitative-biology analyses that require sustained reasoning across long, complex tasks rather than single-step answers. The headline result here is about efficiency as much as capability: GPT-5.6 Sol achieves stronger results than GPT-5.5 while using fewer output tokens to get there.

This distinction matters practically. A model that scores higher but requires significantly more compute to do so is a mixed result — better output, higher cost. A model that scores higher while using less compute is a genuine efficiency gain on top of a capability gain, which is the more valuable kind of improvement for anyone paying per token.

ExploitBench — Vulnerability Research and Exploitation

ExploitBench measures long-horizon cybersecurity tasks: finding vulnerabilities and developing them into working exploits, the kind of sustained technical work that real security research actually requires. The most notable claim from OpenAI here is that GPT-5.6 Sol performs competitively with Mythos Preview — Anthropic's earlier Mythos-class model — while using only about one-third of the output tokens that comparison would typically require.

This is an efficiency claim worth sitting with. If accurate, it means OpenAI achieved comparable cybersecurity capability to a previous frontier-tier Anthropic model using roughly a third of the compute cost. The benchmark chart includes a wide field for comparison: Mythos Preview, Claude Opus 4.7, Claude Mythos 5, Claude Opus 4.8, GPT-5.6 Sol, Terra, and Luna, plus GPT-5.5 and GPT-5.4 from OpenAI's own prior generations — giving a genuinely comprehensive cross-company picture of where cybersecurity capability currently stands across the industry.

ExploitGym — Independent Academic Validation

ExploitGym was built by researchers at UC Berkeley in collaboration with OpenAI and other frontier AI labs — making it one of the more independently credible cybersecurity benchmarks currently in use, since it was not designed solely by the company being evaluated. Across both 2-hour and 6-hour time limits, GPT-5.6 Sol, Terra, and Luna all show strong improvements in cyber capability as reasoning effort increases — confirming that the relationship between thinking time and cybersecurity task performance holds consistently across the entire new model family, not just the flagship.

The Safety Side: Why This Launch Comes With So Many Guardrails

The Core Tension OpenAI Is Managing

Every capability jump in cybersecurity creates the same fundamental problem: the same skill that helps a defender find and patch a vulnerability is the skill that helps an attacker exploit one. OpenAI states its position on this directly — GPT-5.6 Sol is better at helping people find and fix vulnerabilities than it is at reliably carrying out complete, end-to-end attacks. That distinction is the foundation the entire safety approach is built on.

Where Sol Sits on OpenAI's Risk Scale

Under OpenAI's Preparedness Framework — the company's internal system for classifying how dangerous a model's capabilities are — GPT-5.6 Sol does not cross the "Cyber Critical" threshold. In testing against real software including Chromium and Firefox, the model identified bugs and exploitation primitives — the foundational building blocks an exploit would need — but did not autonomously produce a complete, functional exploit chain under the conditions tested.

OpenAI is careful to note the limits of that finding. Benchmark thresholds, by design, cannot capture every possible way a model might be used, combined with other tools, or deployed in a context the testing did not anticipate. That uncertainty, layered on top of the genuine step-change in capability this model represents, is the stated reason for pairing increased capability with stronger safeguards and a deliberately phased rollout rather than an immediate full release.

The Four-Layer Safeguard Stack

OpenAI describes its approach as layered specifically because no single safeguard holds up against a determined or adaptive attacker on its own. The four layers work together:

Layer one — model-level training. GPT-5.6 is trained to refuse prohibited cyber assistance as a default behavior, including in cases where a user tries to disguise their intent or attempt a jailbreak. This is the first line of defense, built into the model's behavior itself rather than added afterward.

Layer two — real-time classifiers. As the model generates a response, separate classifiers evaluate the output for cyber and biology misuse signals in real time. In higher-risk cases, generation can pause while a larger, more capable reasoning model reviews the full conversation and its context. If that review determines the output should not be allowed, it is withheld before the user ever sees it.

Layer three — account-level review. Flagged activity can trigger a broader review across a user's relevant conversations and risk signals, not just the single message that triggered the flag. This wider view matters because dual-use security concepts — the same technical knowledge used defensively and offensively — can look identical in isolation but reveal a clear pattern of intent when viewed across a longer history.

Layer four — differentiated access. The most sensitive capabilities are not made broadly available by default. Access is calibrated, preserving the ability of legitimate security researchers and defenders to do their work while keeping the highest-risk capabilities away from general availability.

OpenAI is direct about the tradeoff this creates during the preview period: users may encounter requests that get blocked or refused, and some requests may take longer because generation has been paused for review. The company describes this explicitly as expected and, in fact, part of what the preview is designed to test — not just whether the safeguards stop misuse, but whether legitimate users can still get their normal work done reliably.

The Government Coordination — The Detail Worth Paying Attention To

This is the part of the announcement that carries the most weight beyond the technical specifications.

OpenAI states plainly that, as part of its ongoing engagement with the US government, it previewed both its release plans and the new models' capabilities to government officials ahead of today's announcement. At the government's request, OpenAI is beginning with a limited preview restricted to a small group of trusted partners — and the identities of those partners have been shared with the government.

What makes this notable is not just that it happened, but how OpenAI chose to frame it. The company explicitly states that it does not believe this kind of government access process should become a permanent feature of how AI models get released. Their stated concern is direct: a government-gated release process keeps capable tools out of the hands of users, developers, enterprises, and cyber defenders who could benefit from them. OpenAI frames the current limited preview as a short-term step they believe is the fastest path toward broader availability, while working with the current Administration to build out a cyber Executive Order framework and a more repeatable process that could apply to future model releases without requiring this same ad hoc coordination each time.

This is a genuinely unusual thing for a company to disclose this clearly. It confirms that a sitting government had direct input into the rollout timeline of a major commercial AI product — and it shows a company pushing back, on the record, against the precedent that interaction could set going forward.

Hardening Against Real-World Attacks: The Scale of the Red-Teaming Effort

Safety systems that only work against attacks they were specifically designed to catch are not safety systems that will hold up in the real world. OpenAI's response to that problem with GPT-5.6 is the largest automated red-teaming effort the company has disclosed.

OpenAI dedicated more than 700,000 A100-equivalent GPU hours specifically to automated red-teaming aimed at finding universal jailbreaks — attacks designed to work across many different prompts and contexts rather than exploiting one narrow, specific weakness. The reasoning behind targeting universal jailbreaks specifically is that they represent the more dangerous and more general category of failure. A model that uses its own capabilities to attack itself at this scale can explore vastly more attack patterns than human red-teamers could cover manually, surfacing failure patterns earlier and shortening the time between discovering a weakness and fixing it.

This automated effort runs alongside, not instead of, human expert red-teaming. OpenAI worked with third-party testers to conduct extensive human-led adversarial testing, which continues throughout the preview period specifically to catch the kind of creative, unanticipated misuse attempts that automated systems might not think to try.

When a new jailbreak is discovered through either method, OpenAI maintains a rapid-response process: reproduce the failure, assess its severity, prioritize the fix, remediate it, and then fold that specific failure into the ongoing evaluation suite so similar attacks are caught automatically in the future.

Availability and Pricing

During the preview period, GPT-5.6 models are available through the API and through Codex, limited to the select group of trusted partners and organizations mentioned earlier. OpenAI states the plan is to expand to ChatGPT, Codex, and general API access for everyone "soon," though no specific date beyond "coming weeks" has been disclosed.

Pricing, per one million tokens, is structured clearly across the three tiers:

Model	Input Price	Output Price
Sol	$5	$30
Terra	$2.50	$15
Luna	$1	$6

For context, Terra's pricing reflects the "2x cheaper than GPT-5.5 at competitive performance" positioning OpenAI describes — a genuinely aggressive pricing move if the performance claims hold up under broader real-world use.

New Caching Improvements

GPT-5.6 also introduces more predictable prompt caching behavior, including support for explicit cache breakpoints that developers can define and a guaranteed 30-minute minimum cache life. The economics shift slightly here too: cache writes are now billed at 1.25 times the model's standard uncached input rate, while cache reads continue to receive the existing 90% discount on cached input tokens. For high-volume applications that reuse the same prompt structure repeatedly, this caching behavior can meaningfully affect total cost.

Cerebras: Frontier Speed

Separately, OpenAI announced that GPT-5.6 Sol will become available on Cerebras hardware starting in July, running at speeds of up to 750 tokens per second — a substantial leap in raw generation speed compared to standard API delivery. Access will initially be limited to select customers as Cerebras and OpenAI work to expand available capacity. For latency-sensitive applications where response speed is the binding constraint rather than cost, this represents a meaningfully faster way to access frontier-level capability.

What This Means in Context

GPT-5.6 Sol's preview lands at a moment when the entire frontier AI industry is grappling with the same underlying question: how do you release genuinely more capable models without also releasing more capable tools for causing serious harm. The benchmark comparisons in this announcement — Sol against Claude Mythos 5, against Claude Fable 5, against Gemini 3.1 Pro Preview — confirm that every major AI lab is now operating in roughly the same capability neighborhood on the hardest cybersecurity and coding tasks, even as each company makes different choices about how cautiously to release that capability.

OpenAI's approach this time combines a genuine architectural innovation — splitting generation from capability tier in the naming system, introducing subagent-based Ultra mode — with the most extensive safety investment the company has publicly disclosed for a single release. The government coordination adds a layer that goes beyond OpenAI's own internal decisions, reflecting a broader reality that frontier AI capability has reached a point where governments are actively involved in release timing, not just policy discussion after the fact.

Final Takeaway

GPT-5.6 Sol, Terra, and Luna represent OpenAI's most structurally significant model release in some time — not just because of what the models can do, but because of how OpenAI chose to talk about the process of releasing them. A new naming architecture that separates generation from capability tier. A subagent-powered Ultra mode that meaningfully outperforms single-agent reasoning on hard coding tasks. A four-layer safety stack built specifically to handle the tension between defensive and offensive cybersecurity use. And a limited preview shaped directly by conversations with the US government, which OpenAI has gone on record saying it hopes will not become the standard pattern for future releases.

The benchmark numbers are genuinely strong, particularly Sol Ultra's lead on Terminal-Bench 2.1 and the token-efficiency gains on GeneBench v1 and ExploitBench. But the more consequential story in this announcement may be the precedent being set around how frontier AI labs and governments coordinate the pace of capability releases — a question that will likely resurface with every major model launch from here forward.