Proceedings of the National Academy of Sciences (2026)
Preview abstract
Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. This work evaluates the performance of six different LLM-based systems for answering scientific literature questions, including commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. We conduct a rigorous expert evaluation of the systems in the domain of high-temperature cuprate superconductors, a research area that involves material science, experimental physics, computation, and theoretical physics. We use an expert-curated database of 1726 scientific papers and a set of 67 expert-formulated questions. The evaluation employs a multi-faceted rubric assessing balanced perspectives, factual comprehensiveness, succinctness, evidentiary support, and image relevance. Our results demonstrate that RAG-based systems, powered by curated data and multimodal retrieval, outperform existing closed models across key metrics, particularly in providing comprehensive and well-supported answers, and in retrieving relevant visual information. This study provides valuable insights into designing and evaluating specialized scientific literature understanding systems, particularly with expert involvement, while also highlighting the importance of rich, domain-specific data in such systems.View details
Preview abstract
This paper demonstrates that artificial intelligence can accelerate mathematical discovery by autonomously solving an open problem in theoretical physics. We present a neuro-symbolic system, combining the Gemini Deep Think large language model with a systematic Tree Search (TS) framework and automated numerical feedback, that successfully derived novel, exact analytical solutions for the power spectrum of gravitational radiation emitted by cosmic strings. Specifically, the agent evaluated the core integral for arbitrary loop geometries, directly improving upon recent AI-assisted attempts that only yielded partial asymptotic solutions. To substantiate our methodological claims regarding AI-accelerated discovery and to ensure transparency, we detail system prompts, search constraints, and intermittent feedback loops that guided the model. The agent identified a suite of 6 different analytical methods, the most elegant of which expands the kernel in Gegenbauer polynomials to naturally absorb the integrand's singularities. The methods lead to an asymptotic result for at large that both agrees with numerical results and also connects to the continuous Feynman parameterization of Quantum Field Theory. We detail both the algorithmic methodology that enabled this discovery and the resulting mathematical derivations.View details
Preview abstract
The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments. To address this, we present Empirical Research Assistance (ERA), an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS) to systematically improve the quality metric and intelligently navigate the large space of possible solutions. ERA achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a diverse range of tasks. In bioinformatics, ERA discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, ERA generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. ERA also produced expert-level software for geospatial analysis, neural activity prediction in zebrafish, and numerical solution of integrals, and a novel rule-based construction for time series forecasting. By devising and implementing novel solutions to diverse tasks, ERA represents a significant step towards accelerating scientific progress.
Keywords: Tree Search, Generative AI, Scorable Scientific Tasks, Empirical SoftwareView details
Preview abstract
Large language models (LLMs) have demonstrated abilities to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly-used approximation method in quantum physics: the Hartree-Fock method, requiring an analytic multi-step calculation deriving approximate Hamiltonian and corresponding self-consistency equations. To carry out the calculations using LLMs, we design multi-step prompt templates that break down the analytic calculation into standardized steps with placeholders for problem-specific information. We evaluate GPT-4’s performance in executing the calculation for 15 papers from the past decade, demonstrating that, with the correction of intermediate steps, it can correctly derive the final Hartree-Fock Hamiltonian in 13 cases. Aggregating across all research papers, we find an average score of 87.5 (out of 100) on the execution of individual calculation steps. We further use LLMs to mitigate the two primary bottlenecks in this evaluation process: (i) extracting information from papers to fill in templates and (ii) automatic scoring of the calculation steps, demonstrating good results in both cases.
View details
Preview abstract
This study introduces a unified computational framework connecting acoustic, speech and word-level linguistic structures to study the neural basis of everyday conversations in the human brain. We used electrocorticography to record neural signals across 100 h of speech production and comprehension as participants engaged in open-ended real-life conversations. We extracted low-level acoustic, mid-level speech and contextual word embeddings from a multimodal speech-to-text model (Whisper). We developed encoding models that linearly map these embeddings onto brain activity during speech production and comprehension. Remarkably, this model accurately predicts neural activity at each level of the language processing hierarchy across hours of new conversations not used in training the model. The internal processing hierarchy in the model is aligned with the cortical hierarchy for speech and language processing, where sensory and motor regions better align with the model’s speech embeddings, and higher-level language areas better align with the model’s language embeddings. The Whisper model captures the temporal sequence of language-to-speech encoding before word articulation (speech production) and speech-to-language encoding post articulation (speech comprehension). The embeddings learned by this model outperform symbolic models in capturing neural activity supporting natural speech and language. These findings support a paradigm shift towards unified computational models that capture the entire processing hierarchy for speech comprehension and production in real-world conversations.View details
International Conference on Learning Representations (ICLR) (2025)
Preview abstract
The core of the scientific problem-solving process involves synthesizing information while applying expert knowledge. Large Language Models (LLMs) have the potential to accelerate this process due to their extensive knowledge across a variety of domains. Recent advancements have also made it possible for LLMs to handle very long "in-context" content. However, existing evaluations of long-context LLMs have focused on assessing their ability to summarize or retrieve information within the given context, primarily in generalist tasks that do not require deep scientific expertise. To facilitate analogous assessments of domain-specific tasks, we introduce the scientific long-Context Understanding and Reasoning Inference Evaluations (CURIE) benchmark. This benchmark provides a set of 8 challenging tasks, derived from around 250 scientific research papers, requiring domain expertise, comprehension of long in-context information, and multi-step reasoning that tests the ability of LLMs to assist scientists in realistic workflows. Tasks in CURIE have been collected from experts in six disciplines - materials science, theoretical condensed matter physics, quantum computing, geospatial analysis, biodiversity, and protein sequencing - covering both experimental and theoretical workflows in science. We evaluate a range of closed and open LLMs on these tasks. Additionally, we propose strategies for task decomposition, which allow for a more nuanced evaluation of the models and facilitate staged multi-step assessments. We hope that insights gained from CURIE can guide the future development of LLMs.View details
Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities, Association for Computational Linguistics (2025), pp. 31-45
Preview abstract
We present components of an AI-assisted academic writing system including citation recommendation and introduction writing. The system recommends citations by considering the user’s current document context to provide relevant suggestions. It generates introductions in a structured fashion, situating the contributions of the research relative to prior work. We demonstrate the effectiveness of the components through quantitative evaluations. Finally, the paper presents qualitative research exploring how researchers incorporate citations into their writing workflows. Our findings indicate that there is demand for precise AI-assisted writing systems and simple, effective methods for meeting those needs.View details
Preview abstract
Building precise simulations of the real world and using numerical methods to solve quantitative problems is an essential task in engineering and physics. We present FEABench, a benchmark to evaluate the ability of language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA) software. We introduce a multipronged evaluation scheme to investigate the ability of LLMs to solve these problems using COMSOL Multiphysics. We further design an LLM agent equipped with the ability to interact with the software through its Application Programming Interface (API), examine its outputs and use tools to improve its solution over several iterations. Our best performing strategy generates executable API calls 88% of the time. However, this benchmark still proves to be challenging enough that the LLMs and agents we tested were not able to completely and correctly solve any problem. LLMs that can successfully interact with and operate FEA software to solve problems such as those in our benchmark would push the frontiers of automation in engineering. Acquiring this capability would augment LLMs' reasoning skills with the precision of numerical solvers and advance the development of autonomous systems that can tackle complex problems in the real world.View details
Preview abstract
The ionosphere is a layer of weakly ionized plasma bathed in Earth’s geomagnetic field extending about 50–1,500 kilometres above Earth1. The ionospheric total electron content varies in response to Earth’s space environment, interfering with Global Satellite Navigation System (GNSS) signals, resulting in one of the largest sources of error for position, navigation and timing services2. Networks of high-quality ground-based GNSS stations provide maps of ionospheric total electron content to correct these errors, but large spatiotemporal gaps in data from these stations mean that these maps may contain errors3. Here we demonstrate that a distributed network of noisy sensors—in the form of millions of Android phones—can fill in many of these gaps and double the measurement coverage, providing an accurate picture of the ionosphere in areas of the world underserved by conventional infrastructure. Using smartphone measurements, we resolve features such as plasma bubbles over India and South America, solar-storm-enhanced density over North America and a mid-latitude ionospheric trough over Europe. We also show that the resulting ionosphere maps can improve location accuracy, which is our primary aim. This work demonstrates the potential of using a large distributed network of smartphones as a powerful scientific instrument for monitoring Earth.View details
Preview abstract
Accelerating text input in augmentative and alternative communication (AAC) is a long-standing area of research with bearings on the quality of life in individuals with profound motor impairments. Recent advances in large language models (LLMs) pose opportunities for re-thinking strategies for enhanced text entry in AAC. In this paper, we present SpeakFaster, consisting of an LLM-powered user interface for text entry in a highly-abbreviated form, saving 57% more motor actions than traditional predictive keyboards in offline simulation. A pilot study on a mobile device with 19 non-AAC participants demonstrated motor savings in line with simulation and relatively small changes in typing speed. Lab and field testing on two eye-gaze AAC users with amyotrophic lateral sclerosis demonstrated text-entry rates 29–60% above baselines, due to significant saving of expensive keystrokes based on LLM predictions. These findings form a foundation for further exploration of LLM-assisted text entry in AAC and other user interfaces.View details