Most AI benchmarks ask models to recall facts or answer structured questions. LifeSciBench asks them to do what researchers actually do — interpret incomplete evidence, pressure-test clinical data, and make decisions under uncertainty. The results reveal both real progress and significant gaps.
Introduction
There is a version of AI benchmarking that has been running for years: give a model a multiple choice question, check whether it picked the right letter, count how many it got right, declare a winner. This approach produces clean numbers and clear leaderboards. It also tells you very little about whether an AI system is actually useful to a working scientist.
A researcher preparing for an FDA meeting does not answer multiple choice questions. They interpret a package of clinical data, identify where the evidence fails to support the conclusions, anticipate regulatory objections, and recommend specific additional studies needed to close the gaps. A researcher troubleshooting a failed assay reads gel images, cross-references protocol notes, considers experimental variables, and reasons through what went wrong and how to fix it.
OpenAI's LifeSciBench is designed to measure exactly this kind of capability — not biology knowledge in isolation, but the full arc of judgment, reasoning, and communication that practicing life scientists exercise every day. Built by 173 expert scientists with PhD training and direct industry experience, and validated by 453 independent expert reviewers, it represents one of the most rigorous AI benchmarks ever constructed for a specific scientific domain.
The results show that frontier AI has made real progress. They also show, with uncomfortable specificity, exactly where it still falls short.
Quick Summary
| Detail | Number |
|---|---|
| Total tasks | 750 |
| Expert scientist contributors | 173 |
| Task artifacts included | 1,062 |
| Total rubric criteria | 19,020 |
| Average rubric criteria per task | 25 |
| Independent expert reviewers | 453 |
| Reviewer PhD rate | 97% |
| Average reviewer field experience | 12 years |
| Tasks requiring multi-step reasoning | 79% |
| Tasks requiring artifact interpretation | 53% |
| Best model overall pass rate | 36.1% (GPT-Rosalind) |
Why Existing Benchmarks Are Not Enough
The problem LifeSciBench addresses is not that existing life science benchmarks are poorly designed. Many are well constructed for what they measure. The issue is what they do and do not measure.
Most life science evaluations concentrate on narrow domains or isolated skills. Questions tend to have structured formats and clean reference answers. These characteristics make benchmarks easy to administer and grade, but they create a gap between tested capability and real-world usefulness. A model that scores well on a fact-recall biology test has demonstrated memory and pattern recognition. It has not demonstrated whether it can contribute to a drug discovery program.
Real research tasks look different. A scientist evaluating a gene therapy clinical package does not ask "which antibody is used to detect dystrophin?" They ask "does this antibody actually measure what we think it measures, and what would a skeptical FDA reviewer say about our reliance on it?" A researcher analyzing a failed experiment does not ask "what is the melting temperature of this primer?" They ask "which of the eight variables we changed could explain this unexpected result, and how do we isolate it?"
LifeSciBench was designed around the second category — the kind of scientific judgment that determines whether an AI is genuinely useful to a research team or merely able to pass a quiz.
How LifeSciBench Was Built
The People Behind It
Every task in LifeSciBench was written by a practicing life scientist. The 173 contributors all held PhD-level training and had direct experience advancing drug discovery programs in biotech or pharmaceutical settings. These were not academics writing hypothetical scenarios — they were people who had spent careers doing exactly the kind of work the benchmark is designed to evaluate.
Tasks were structured to resemble a request a scientist might make of a knowledgeable collaborator: a scientific prompt, relevant context or attached files, and a free-response answer. The format avoided multiple choice and constrained formats, instead requiring models to produce the kind of open-ended expert response that a research team would actually use.
The Revision Process
No task entered the benchmark after a single draft. Each submitted task could undergo as many revision cycles as needed before acceptance — there was no cap on the number of rounds. Accepted tasks averaged six self-directed automated review cycles before reaching human review, then completed at least two rounds of expert evaluation. The standard for acceptance was either a verifiable correct answer or strong expert consensus — defined as at least 90% agreement among domain-relevant reviewers. Tasks that did not reach that threshold were revised further or rejected.
The Rubrics
This is where LifeSciBench most significantly differs from standard benchmarks. Each task is graded not by checking a final answer but by evaluating the reasoning, evidence handling, caveats, calculations, and conclusions that lead to it.
The 750 tasks collectively contain 19,020 rubric criteria — an average of 25 per task. A response that reaches the correct high-level conclusion but overlooks a key assay limitation is marked incomplete. A response that fails to fully solve a task but contains high-quality partial reasoning earns partial credit on the relevant criteria. This design reflects how scientific work is actually evaluated: a research update that gets the conclusion right but misses a critical confound is not a good research update.
The Validation
Four hundred and fifty-three independent expert reviewers — none of whom were involved in writing the tasks — evaluated the benchmark. Of those reviewers, 97% held a PhD or equivalent doctorate, with an average of 12 years of field experience and 14 peer-reviewed publications. Eighty-eight percent had received at least one research award or fellowship.
Each reviewer assessed tasks across four dimensions:
| Validation Category | Strong Agree | Overall Agree |
|---|---|---|
| Reflects realistic real-world life science work | 90.4% | 98.3% |
| Tests the right scientific reasoning and domain skills | 86.4% | 98.1% |
| Scientifically grounded and answerable | 77.1% | 96.5% |
| Overall strong evaluation task | 79.1% | 96.6% |
Agreement exceeded 96% in every category — a level of validator consensus that gives the benchmark's construction significant credibility.
What LifeSciBench Covers
Seven Scientific Workflows
The benchmark's taxonomy emerged from surveys of practicing life scientists about their most frequently used workflows in applied research settings. The seven resulting categories cover the full span of research activity:
Evidence Handling — Extracting, reconciling, and auditing scientific evidence from published papers, experimental figures, data tables, and lab records. A model must identify not just what a paper reports but whether the reporting methodology supports the conclusions drawn.
Analysis — Interpreting experimental data, including complex figures and multimodal files, to reach scientifically defensible conclusions.
Design, Optimization, and Prediction — Designing experiments, optimizing protocols, and predicting outcomes from incomplete information — one of the hardest categories in the benchmark.
Scientific Reasoning — Multi-step logical reasoning through ambiguous or conflicting evidence toward a research decision.
Validation and Operations — Troubleshooting assays, evaluating protocols, and managing the practical challenges of laboratory work.
Translation — The bench-to-bedside process: connecting preclinical findings to clinical implications, evaluating translational risk, and making go or no-go recommendations for drug development programs.
Scientific Communication — Organizing evidence and producing expert-facing explanations, summaries, and presentations that would be genuinely useful to a research audience.
Task Complexity
The numbers here matter for understanding what LifeSciBench actually requires of AI systems. Seventy-nine percent of tasks require multiple reasoning or decision-making steps, with an average of four steps per task. This is not a benchmark of single-inference questions.
More demanding still: 53% of tasks require models to interpret or synthesize information from at least one attached artifact. Those artifacts include figures, PDFs, data tables, genetic sequence files, molecular structure files, chemical data files, and web references. A model that can only reason over text is immediately disadvantaged on more than half the benchmark.
A Real Example: What LifeSciBench Looks Like in Practice
The benchmark includes an example task from the Evidence Handling workflow that illustrates the depth of reasoning required.
The scenario: a team preparing for an FDA meeting on an AAV9-based micro-dystrophin gene therapy for Duchenne muscular dystrophy wants a hard-nosed critique of whether their clinical data package supports accelerated approval using micro-dystrophin expression as a surrogate endpoint.
The package includes Western blot data showing 38% of healthy-control dystrophin expression in post-treatment biopsies, immunofluorescence results showing sarcolemmal signal in 75–95% of fibers, a 48-week functional score improvement of +1.4 NSAA points versus an external natural history cohort, and a safety profile that includes transaminitis in 8 of 12 patients and one case of myocarditis.
A complete expert response identifies twelve distinct failure modes in this package — each with a specific scientific reason it fails to support the conclusion and a specific recommendation for what additional data or analysis would close the gap. Among those failures: the antibody used for Western blot quantification detects an epitope shared by both the transgene and endogenous dystrophin, making it impossible to distinguish micro-dystrophin from residual full-length protein. The immunofluorescence antibody targets a C-terminal region that the truncated 138 kDa construct does not contain, meaning the signal likely reflects revertant fibers rather than transgene expression. The functional comparison against an external natural history cohort is not a randomized controlled design, and the +1.4 NSAA improvement falls within test-retest variability for the 4-to-7 age group. The construct itself deletes the spectrin repeats that contain nNOS-binding sites, creating a mechanistic ceiling on functional rescue that expression level cannot address.
This is what a research-level AI evaluation looks like. Not "what is micro-dystrophin?" but "is this specific clinical package adequate, and exactly where does it fail?"
How Frontier Models Performed
The Overall Picture
Five models appear in the benchmark results: GPT-Rosalind, GPT-5.5, Gemini 3.1 Pro, GPT-5.4, and Grok 4.3. GPT-Rosalind leads the field with an overall pass rate of 36.1%, compared to 25.7% for GPT-5.5 — a meaningful improvement of more than ten percentage points.
The absolute numbers are worth sitting with. The best-performing model passes just over one-third of tasks. This is not a benchmark that frontier AI has solved. It is a benchmark that reveals how much runway remains.
Where AI Is Making Progress
The clearest gains appear in the two categories that rely most heavily on language and synthesis rather than precise calculation or artifact parsing.
Scientific Communication — organizing evidence into expert-facing explanations — rose from a 56.3% pass rate on GPT-5.5 to 71.1% on GPT-Rosalind. The category is small at nine tasks, so this result requires cautious interpretation, but the direction is consistent with the broader pattern: AI systems are improving fastest where the task rewards structured articulation of well-bounded evidence.
Translation — connecting preclinical findings to clinical implications — improved from 36.8% to 57.7%, a 20-point gain that suggests frontier models are developing a meaningfully better ability to reason across the bench-to-bedside gap. For drug development teams evaluating whether to use AI to assist with translational risk assessment, this trend is worth watching.
On tasks requiring expert-useful or actionable outputs, GPT-Rosalind scores 44.7% versus GPT-5.5's 29.1%. On tasks requiring uncertainty and caveat handling — identifying the limits of evidence and flagging what conclusions cannot be drawn — GPT-Rosalind scores 44.8% versus 29.3%. The pattern holds: AI systems perform best when the task has a clear evidence boundary and calls for structured scientific judgment rather than exact numerical or structural outputs.
Where AI Still Falls Short
The gaps are equally consistent and worth examining in detail.
Design, Optimization, and Prediction sits at a 30.7% pass rate for GPT-Rosalind — one of the hardest categories in the benchmark. Designing experiments and predicting outcomes from incomplete information requires the kind of generative scientific reasoning that current models have not mastered.
Analysis reaches only 30.3%. Despite improvements in language quality, models still struggle to extract, integrate, and reason over experimental data in ways that produce reliably correct analytical conclusions.
The artifact gap is the starkest finding. On text-only tasks, GPT-Rosalind achieves a 45.1% pass rate. On tasks that include artifacts — figures, sequence files, data tables, structural files — that drops to 28.1%. A 17-percentage-point drop when scientific data moves from prose into a file. GPT-5.5 shows the same pattern: 29.9% text-only, 21.9% with artifacts. This is not a small variance. More than half of LifeSciBench tasks require artifact interpretation. The gap means frontier models are substantially weaker on the majority of the benchmark than their text-only performance would suggest.
Exact scientific outputs expose specific brittleness. Numeric tasks — requiring precise calculations — reach only 14.8% for GPT-Rosalind. Sequence and structure outputs reach 24.0%. Construct generation tasks sit at 27.3% with little improvement over GPT-5.5. These failures are scientifically meaningful because many real research workflows require outputs that are exact enough to use directly — a CRISPR donor construct, an siRNA sequence, a docking score — not approximately correct.
The partial answer problem. Approximately 14% of tasks see models earn substantial rubric credit while still failing the exact-pass threshold. For GPT-Rosalind, 109 tasks had pass rates below 20% but still earned at least 50% of the available rubric reward. Models identify the relevant evidence, reach a plausible partial answer, but miss a key constraint, apply the wrong evidence to the wrong step, make an incomplete calculation, or fail to connect their reasoning to a final decision a scientist could act on. The gap between "partly right" and "scientifically useful" is where a significant portion of frontier AI failure lives.
What These Results Mean for Research Teams
The LifeSciBench results carry a few practical implications for scientists and organizations evaluating AI tools for research.
For tasks involving scientific synthesis, communication, and structured interpretation of text-based evidence, frontier models are showing genuine utility. A researcher who needs a first-pass critique of a literature review, a summary of translational risks from a set of preclinical papers, or a structured explanation of a mechanism for a non-expert audience is working in a domain where AI assistance is increasingly meaningful.
For tasks that require precise artifact interpretation, exact sequence or structure outputs, experimental design, or quantitative analysis, the current pass rates are low enough that AI output requires careful expert validation before it can influence a research decision. A 14.8% pass rate on numeric tasks is not a foundation for autonomous AI-assisted calculation in a drug discovery program.
The partial-answer finding matters practically: a model that earns 50% rubric credit on a task may still produce output that is useless or misleading in a research context. The fact that a response contains some high-quality reasoning does not guarantee that the decision it supports is correct.
What LifeSciBench Does Not Measure
The benchmark's creators are direct about its limitations, and those limitations are worth understanding.
LifeSciBench measures performance on self-contained tasks. Real research is not self-contained. Scientists gather new evidence, revise hypotheses, run follow-up experiments, adapt to unexpected results, and revisit decisions weeks or months after they were made. The iterative, temporally extended nature of research programs is outside what any task-level benchmark can capture.
Strong LifeSciBench performance should therefore be read as evidence of realistic task-level capability — not as a prediction of how much AI will actually accelerate a research program, reduce time-to-IND, or improve clinical success rates. Those outcomes depend on factors that unfold over time, across teams, in environments that no benchmark simulates.
What Comes Next
The explicit next step is connecting benchmark performance to live research environments. While LifeSciBench was developed with practicing scientists, measuring whether AI actually accelerates discovery requires studying model deployment in real research settings, over longer time horizons, and across the iterative reasoning and experimental feedback cycles that constitute actual scientific progress.
That transition — from benchmarked performance to measured real-world impact — is the work that makes LifeSciBench relevant beyond an academic exercise. It is also significantly harder to do well than building the benchmark itself.
Final Takeaway
LifeSciBench is the most rigorous attempt yet to measure whether AI can contribute to life science research as it is actually practiced — not as a quiz subject. Seven hundred and fifty tasks written by 173 PhD scientists, graded by 19,020 rubric criteria, and validated by 453 independent expert reviewers with an average of 12 years of field experience: the construction alone is a significant undertaking.
The results tell a story in two parts. Frontier AI is improving meaningfully on the parts of research work that reward synthesis, structured judgment, and evidence communication — the translation workflow improving from 36.8% to 57.7% is a real gain. And frontier AI still struggles significantly on the parts that require exact numerical outputs, artifact interpretation, experimental design, and the kind of multi-step precision that makes a response directly usable in a lab.
A 36.1% overall pass rate from the best available model means that the majority of life science research tasks in this benchmark remain beyond what current AI can reliably handle. That is not a dismissal of progress — it is a clear-eyed map of where progress still needs to go.
