Product Launches

Anthropic and the US Government Built an AI to Catch Nuclear Weapons Queries in Real Time — Here Is How It Works

Anthropic
Jun 18, 202612 min read3 views
+1
Anthropic and the US Government Built an AI to Catch Nuclear Weapons Queries in Real Time — Here Is How It Works

A year of classified red teaming. A data-sharing problem that required a creative workaround. And a classifier that caught real threats before its creators even knew it was being tested. This is the story of the most unusual AI safety collaboration of 2025.

A year of classified red teaming. A data-sharing problem that required a creative workaround. And a classifier that caught real threats before its creators even knew it was being tested. This is the story of the most unusual AI safety collaboration of 2025.


Introduction

Nuclear technology sits at one of the most uncomfortable intersections in modern science: the same physics that heats millions of homes and powers entire cities is the same physics that destroyed Hiroshima and Nagasaki. This dual-use reality has governed nuclear policy for eighty years. Now it has to govern AI policy too.

As large language models grow more capable, they accumulate knowledge that can inform dangerous applications as readily as beneficial ones. A model capable of explaining nuclear engineering concepts for a student's coursework is, in principle, capable of providing technical guidance to someone with far worse intentions. The question of where that line sits — and how to enforce it automatically, at the scale of millions of conversations — is one that no private company was equipped to answer alone.

On August 21, 2025, Anthropic published the results of a year-long collaboration with the U.S. Department of Energy (DOE) and its National Nuclear Security Administration (NNSA) that produced a first-of-its-kind answer: an AI classifier, now deployed on live Claude traffic, that identifies potentially dangerous nuclear-related conversations with 96.2% accuracy in preliminary testing — catching 94.8% of genuine nuclear weapons queries while generating zero false positives on legitimate discussions.


Quick Summary

Detail Information
Partnership started April 2024
Partners Anthropic, US DOE, NNSA, DOE National Laboratories
What was built AI classifier for nuclear weapons query detection
Overall accuracy 96.2% in preliminary testing
Weapons query detection rate 94.8%
False positive rate in testing 0%
Current deployment Experimental — live on percentage of Claude traffic
Industry sharing plan Frontier Model Forum blueprint

Why Nuclear Requires a Different Approach

Most AI safety concerns involve content that is harmful but accessible — harassment, fraud, misinformation. The response to those risks involves clear policies, content filters, and human review processes that companies can develop internally.

Nuclear weapons knowledge is categorically different. The technical details that matter most are classified at the national security level. A private AI company does not have — and should not have — unmediated access to the specific information needed to understand exactly what kind of nuclear guidance is dangerous enough to block. Anthropic's internal teams could make reasonable judgments about obviously harmful queries, but the edge cases — the technically detailed questions that sit between legitimate nuclear engineering and weapons-relevant assistance — required expertise that only a national security agency possessed.

At the same time, being too aggressive with restrictions creates its own problems. Nuclear topics appear constantly in legitimate contexts: energy policy debates, medical physics, academic coursework, nonproliferation research, and journalism about nuclear programs. An AI that reflexively refuses all nuclear-adjacent conversations fails its users and undermines trust in ways that compound over time.

The needle to thread: catch the dangerous queries, leave the legitimate ones untouched.


How the Partnership Was Structured

Anthropic began formally partnering with the NNSA in April 2024. The initial phase was assessment — NNSA staff conducted red teaming of Claude models in a secure environment over approximately one year, systematically probing the models for nuclear proliferation risks and documenting where and how they could be prompted to produce concerning outputs.

Red teaming in this context meant something different from standard AI safety testing. Because the work involved classified nuclear knowledge, it happened in a secured environment rather than on commercial infrastructure. NNSA specialists — people with clearances and deep domain knowledge — designed test scenarios that reflected real-world threat models, not just hypothetical edge cases.

After approximately a year of this red teaming work, the partnership moved from assessment into something more ambitious: actually building a defense.


What They Built and How It Works

The core output of the collaboration is an AI classifier — a system that automatically categorizes content. The analogy Anthropic uses is accurate and helpful: think of a spam filter. Your email's spam filter does not read every message carefully before deciding whether it is junk. It applies learned patterns at speed, routing millions of messages into correct categories with high reliability. The nuclear classifier does the same thing, but instead of sorting junk from legitimate email, it sorts potentially dangerous nuclear queries from benign ones.

The Development Process

The NNSA shared with Anthropic a carefully curated list of nuclear risk indicators — specific patterns and signals that distinguish concerning conversations about weapons development from legitimate discussions of nuclear energy, medicine, or policy. This list was developed deliberately at a classification level that permitted it to be shared with Anthropic's team without requiring security clearances, solving what would otherwise have been a fundamental barrier to the collaboration.

Anthropic's Policy and Safeguards teams used those indicators to build the classifier. The development followed an iterative cycle:

Generate synthetic test prompts covering both concerning and benign nuclear topics, run them through the classifier, share the results with NNSA for validation, receive feedback, refine the classifier, and repeat. Each cycle improved the system's precision on the edge cases that matter most.

The Data Problem — and Its Solution

The most difficult aspect of the entire project was not technical. It was informational. Two parties needed to work together on a sensitive problem while each operated under constraints that prevented full data sharing:

The NNSA could not share classified nuclear information with a private company. Anthropic could not share user conversation data with a government agency. This created a verification problem: how do you prove a classifier works if neither party can fully show the other what it is being tested against?

The answer was synthetic data generation. Rather than using real classified information or real user conversations as test cases, the teams generated hundreds of synthetic test prompts — artificial conversations constructed to represent both the concerning and benign ends of the spectrum. The NNSA provided example prompts; Anthropic used those as seeds to generate a much larger evaluation set. The result was a robust testing framework that neither compromised classified government information nor exposed user privacy.


The Numbers: What 96% Accuracy Actually Means

In preliminary testing using the synthetic dataset, the classifier achieved results that Anthropic describes as striking the right balance:

94.8% detection rate for nuclear weapons queries. Of the synthetic prompts designed to represent genuinely dangerous nuclear weapons questions, the classifier correctly identified nearly 95 out of every 100 as potentially harmful.

Zero false positives. Of the synthetic prompts designed to represent legitimate discussions — nuclear energy policy, medical applications, academic research, journalism — the classifier flagged none of them as concerning. Every benign conversation was correctly identified as benign.

96.2% overall accuracy. Across all test prompts combined, just under 97% of the classifier's labels matched the known correct answer.

The zero false positive result in synthetic testing is the number that carries the most practical weight. An AI safety system that constantly interrupts students researching nuclear power for energy policy papers, or doctors asking about radiation medicine, would rapidly undermine trust in the AI and create pressure to disable the safeguard entirely. A system that catches dangerous queries without disrupting legitimate ones is the only version that can be maintained long-term.


What Happened When It Met Real Users

Synthetic testing and real-world deployment are different environments, and the classifier's performance on actual Claude conversations revealed nuances that the controlled test data had not fully captured.

The Middle East Case — A False Positive Caught by Layering

During a period when events in the Middle East brought renewed international attention to nuclear weapons, the classifier flagged a number of conversations that, on closer inspection, turned out to be about current events rather than weapons development. People discussing geopolitical nuclear tensions were generating conversational patterns that overlapped with the signals the classifier was trained to detect.

The resolution came through a technique called hierarchical summarization — a process that reviews multiple flagged conversations together rather than evaluating each one in isolation. When the flagged Middle East conversations were examined as a group, the additional context made it clear they were news and policy discussions, not misuse attempts. The classifier alone produced false positives; the classifier combined with hierarchical summarization produced the correct answer.

This outcome confirmed something important about AI safety architecture: no single tool is sufficient. The most reliable systems layer multiple approaches, each compensating for the others' blind spots.

The Red Team Case — Caught Before Anyone Knew It Was Watching

The stronger validation came from an unintentional experiment. Anthropic's internal red teamers — whose job is to deliberately probe the company's systems using adversarial prompts — were conducting routine testing using concerning nuclear queries as part of their standard adversarial process.

These testers did not know the nuclear classifier had been deployed on live traffic.

The classifier caught their test queries and correctly identified them as potentially harmful.

This is the kind of real-world validation that controlled testing cannot replicate. The testers were using genuine adversarial techniques — the type of inputs designed to evade AI safety measures — and the classifier flagged them correctly without any prior knowledge of what it was looking for. The effectiveness held up outside the test environment and against sophisticated probing.


Sharing the Blueprint With the Industry

Anthropic is not keeping this approach proprietary. The methodology — the partnership structure, the synthetic data solution to the information-sharing problem, the iterative validation cycle — is being shared with the Frontier Model Forum, the industry body for frontier AI companies that includes OpenAI, Google, and Microsoft alongside Anthropic.

The intent is for this to serve as a template. Any AI developer working with frontier models could establish a similar partnership with the NNSA, use the same synthetic data approach to navigate the classification-versus-privacy tension, and implement comparable safeguards on their own platforms. The specific classifier Anthropic built is tuned to Claude's architecture, but the process that produced it is transferable.

This sharing decision reflects a recognition that nuclear risk in AI is not a competitive problem. If a dangerous query gets through one AI model because its developer lacked the tools to detect it, the consequence is the same regardless of which company built the model. The risk is industry-wide; the solution needs to be too.


What This Partnership Proves Beyond Nuclear Safety

The nuclear classifier is significant in its own right. But the broader significance of the project is what it demonstrates about the structure of public-private AI safety collaboration.

Government agencies like the NNSA possess domain expertise — deep, classified, specialized knowledge — that private AI companies cannot develop independently and should not try to acquire unilaterally. AI companies possess technical capabilities — the engineering infrastructure, the model access, the deployment pipelines — that government agencies cannot build or maintain at scale on their own.

The NNSA knew what dangerous nuclear queries looked like. Anthropic knew how to build a system that could detect them automatically across millions of conversations. Neither organization could have produced the classifier without the other.

The synthetic data solution to the information-sharing problem is itself a contribution to AI governance practice. The barrier that blocked this kind of collaboration before — classified knowledge on one side, user privacy on the other — was assumed to be insurmountable. Generating synthetic test data from expert-provided seeds broke that impasse without requiring either party to compromise their obligations.

Anthropic describes this as a model that can be replicated in other national security domains. The same framework — domain expert red teaming, risk indicator sharing at an appropriate classification level, synthetic data validation — could apply to biological weapons, chemical agents, radiological materials, or any other area where government expertise and AI capability need to work together.


The Balance That Matters

Every AI safety decision involves a tradeoff between two failure modes. Tighten restrictions too much and the AI becomes unreliable for legitimate users — a student studying nuclear physics gets refused, a researcher analyzing nonproliferation policy gets blocked, a journalist asking about reactor safety gets stonewalled. Loosen restrictions too much and the AI becomes a resource for people seeking to cause harm.

The nuclear classifier's preliminary results — 94.8% detection with zero false positives — suggest this system lands closer to the right balance than most safety tools do at launch. The real-world deployment has confirmed it continues to work on genuine traffic, with the hierarchical summarization layer catching the edge cases that the classifier alone struggles with.

The work is not finished. The classifier currently monitors a percentage of Claude traffic, not all of it. Real-world conversations continue to surface patterns that synthetic data did not anticipate. Each deployment cycle produces new information that feeds back into refinement.

But the direction is clear, the method is documented, and the results are real enough to share.


Final Takeaway

An AI classifier that catches nuclear weapons queries with 96.2% accuracy — built through a year of classified red teaming, a creative workaround for a seemingly impossible data-sharing problem, and a validation cycle that proved the system works even when tested by people who did not know it was there — is a meaningful milestone in AI safety.

More significant than the specific classifier, though, is what produced it. A national security agency and a private AI company found a way to combine what each did best, worked through the institutional and legal friction that usually prevents this kind of collaboration, and produced something neither could have built alone.

That model — government expertise plus industry capability, connected by synthetic data and iterative validation — is the part Anthropic wants the rest of the AI industry to adopt. Nuclear technology has been dual-use for eighty years. AI systems powerful enough to assist with it have existed for less than five. The window for building the right safeguards is open, and this partnership demonstrates it can be done.

Original Source

This analysis is based on reporting from Anthropic.

View on Anthropic
Share:

📌 Related Posts

What do you think?
+1
Share:

Comments

Leave a comment

0/2000