express gazette logo
The Express Gazette
Friday, December 26, 2025

AI Advances in Science Tested by OpenAI Benchmark; GPT-5.2 Tops FrontierScience Scores

The FrontierScience benchmark probes AI reasoning in physics, chemistry, and biology with two difficulty tiers, highlighting progress and persistent limits as researchers weigh future research roles for large language models.

Technology & AI 5 days ago
AI Advances in Science Tested by OpenAI Benchmark; GPT-5.2 Tops FrontierScience Scores

A new benchmark aimed at testing AI in scientific reasoning shows steady progress in the field, with OpenAI’s GPT-5.2 posting the top scores on the latest FrontierScience evaluation. The test weighs two tiers of questions drawn from physics, chemistry, and biology: an Olympiad-style tier designed to push for high-level problem solving, and a more demanding Research tier populated with prompts crafted by Ph.D. scientists to examine open-ended reasoning and the ability to support real-world research. In the results publicly discussed by the consortium, GPT-5.2 achieved 77.1% on the Olympiad tier and 25.3% on the Research tier, a gap that reflects the growing challenge of translating raw reasoning into broadly applicable scientific insight. Supporters say the improvement trajectory over the past year—driven in part by reinforcement learning and enhanced reasoning capabilities—suggests AI is rapidly encroaching on tasks that scientists previously reserved for human effort, at least in the realm of computation and textual reasoning.

The benchmark is meant to illuminate how far current AI can extend beyond straightforward recall and into structured problem solving in science. The questions run the gamut from physics and chemistry to biology and materials science. Some prompts resemble real-world research planning, including requests for theoretical derivations or analyses that would normally require weeks of work from a team of researchers. In one sample prompt, researchers asked for a derivation of electrostatic wave modes in plasma, while another challenged the model to reason about meso-nitrogen atoms in nickel(II) phthalocyanine, a topic that would typically demand extensive specialized knowledge. Generating solutions to such prompts with high fidelity can demand days of simulation and heavy computation, according to participants.

While the numbers are encouraging for AI researchers, FrontierScience organizers stress that the benchmark captures only a slice of scientific work. The questions are text-only, so they do not test an AI’s ability to perform experiments, analyze images from laboratory systems, or interpret complex video data. The sample size is comparatively small—about 100 Olympiad questions and 60 Research questions—making direct comparisons among closely performing models challenging and leaving room for human baselines to shape interpretation. “I expect the benchmark to be highly correlated with existing work … and not that informative about when the models will be actually useful to assist research, but it’s very hard to do otherwise with a benchmark,” said Jaime Sevilla, director of Epoch AI, in an email cited by researchers. Still, many participants view FrontierScience as a meaningful addition to the benchmarking ecosystem, especially as it begins to surface how AI might scale with real discovery tasks.

The broader context of these developments includes ongoing debates about the proper role of AI in science. Experts note that assembling benchmarks for highly specialized fields requires expert input, and the process is labor-intensive and costly. External firms that supply domain experts to design questions and rubrics—such as Mercor and Surge AI—are central to this ecosystem, with the aim of calibrating how difficult a problem should be and how model responses should be judged. Edwin Chen, founder and CEO of Surge AI, has argued that the long-term payoff could involve training AI to collaborate with humans on breakthroughs like proving complex conjectures or validating new hypotheses. Yet even as AI makes strides, many in the field remain cautious about broader applicability.

Beyond benchmarks, AI already touches science in concrete ways. Google DeepMind’s AlphaFold has predicted more than 200 million protein structures, a scale that would have taken far longer through traditional experimental means. Other efforts aim to simulate and control plasma behavior in fusion devices or to generate high-resolution weather predictions. Yet as researchers note, AlphaFold, for instance, reveals protein structure but not necessarily the electronic properties or reaction dynamics that scientists also study. The frontier remains narrow in some respects, even as general-purpose language models show promise in math, coding, and data analysis. A prominent example cited by researchers is a problem that OpenAI researchers and mathematician Sebastien Bubeck tackled with GPT-5: the model identified a previously missed identity that solved a long-standing puzzle during an extended “think time,” illustrating the potential for AI to contribute novel insights when unguided problem solving is allowed.

Still, skepticism remains. Some prominent scientists warn that the volume of output from large language models can overwhelm traditional channels like journals, and not all AI-generated results will be reliable. Carlo Rovelli, a theoretical physicist and editor at Foundations of Physics, has criticized the noise created by AI-assisted submissions and cautions that many outputs may be more about conversational capability than solid science. Others acknowledge tangible benefits in coding and automation; for example, a U.K. chemistry professor reported substantial time savings in coding tasks thanks to advances in LLMs, while noting that the reliability of automated hypotheses is not yet assured. As the field evolves, researchers are weighing the trade-offs between speed, accuracy, and the kinds of reasoning where humans still outperform machines.

The trajectory is clear: if AI performance on structured scientific reasoning continues to improve, researchers anticipate more capable and consistent AI collaborators in the lab and in the field. But many note that a cautious, incremental approach remains prudent. While some researchers describe moments of “loss” or mixed emotions about the pace of progress, others emphasize pragmatic uses—like AI-assisted literature review, hypothesis generation, and data analysis—that could accelerate discovery without replacing human judgment. For now, FrontierScience provides a lens into where AI stands on the journey toward versatile scientific assistants, while the broader scientific community continues to test, critique, and refine how best to harness these powerful tools.


Sources