Xin Liu

Xin Liu

Xin Liu is a Research Scientist at Google Consumer Health Research. He received his PhD in Computer Science from the University of Washington in 2023. His PhD research was supported by the Google PhD Fellowship . His work is at the intersection of ubiquitous and mobile computing, machine learning, and health. In Xin's research, he studies how to enable mobile health + AI at scale, with a focus on building foundation models for sensor and consumer health data. More information: https://xliucs.github.io/
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    A personal health large language model for sleep and fitness coaching
    Anastasiya Belyaeva
    Zhun Yang
    Nick Furlotte
    Chace Lee
    Erik Schenck
    Yojan Patel
    Jian Cui
    Logan Schneider
    Robby Bryant
    Ryan Gomes
    Allen Jiang
    Roy Lee
    Javier Perez
    Jamie Rogers
    Cathy Speed
    Shyam Tailor
    Megan Walker
    Jeffrey Yu
    Tim Althoff
    Conor Heneghan
    Mark Malhotra
    Shwetak Patel
    Shravya Shetty
    Jiening Zhan
    Daniel McDuff
    Nature Medicine (2025)
    Preview abstract Although large language models (LLMs) show promise for clinical healthcare applications, their utility for personalized health monitoring using wearable device data remains underexplored. Here we introduce the Personal Health Large Language Model (PH-LLM), designed for applications in sleep and fitness. PH-LLM is a version of the Gemini LLM that was finetuned for text understanding and reasoning when applied to aggregated daily-resolution numerical sensor data. We created three benchmark datasets to assess multiple complementary aspects of sleep and fitness: expert domain knowledge, generation of personalized insights and recommendations and prediction of self-reported sleep quality from longitudinal data. PH-LLM achieved scores that exceeded a sample of human experts on multiple-choice examinations in sleep medicine (79% versus 76%) and fitness (88% versus 71%). In a comprehensive evaluation involving 857 real-world case studies, PH-LLM performed similarly to human experts for fitness-related tasks and improved over the base Gemini model in providing personalized sleep insights. Finally, PH-LLM effectively predicted self-reported sleep quality using a multimodal encoding of wearable sensor data, further demonstrating its ability to effectively contextualize wearable modalities. This work highlights the potential of LLMs to revolutionize personal health monitoring via tailored insights and predictions from wearable data and provides datasets, rubrics and benchmark performance to further accelerate personal health-related LLM research. View details
    RADAR: Benchmarking Language Models on Imperfect Tabular Data
    Ken Gu
    Kumar Ayush
    Hong Yu
    Zhihan Zhang
    Yuzhe Yang
    Shwetak Patel
    Max Xu
    Mark Malhotra
    Orson Xu
    Evelyn Zhang
    Tim Althoff
    2025
    Preview abstract Language models (LMs) are increasingly being deployed to perform autonomous data analyses, yet their~\textit{\robustnessTerm}-- the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies—remains under-explored. These artifacts are common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data awareness on tabular data. RADAR introduces programmatic perturbations for each unique query table pair, enabling targeted evaluation of model behavior. RADAR~ comprises 2500 queries for data analysis across 55 datasets spanning 20 domains and 5 data awareness dimensions. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance scales with input length. In our evaluation, we identify fundamental gaps in their ability to perform reliable, data-aware analyses. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning. View details
    Passive Heart Rate Monitoring During Smartphone Use in Everyday Life
    Shun Liao
    Paolo Di Achille
    Jiang Wu
    Silviu Borac
    Jonathan Wang
    Eric Teasley
    Lawrence Cai
    Daniel McDuff
    Hao-Wei Su
    Brent Winslow
    Anupam Pathak
    Shwetak Patel
    Jim Taylor
    Jamie Rogers
    (2025)
    Preview abstract Resting heart rate (RHR) is an important biomarker of cardiovascular health and mortality, but tracking it longitudinally generally requires a wearable device, limiting its availability. We present PHRM, a deep learning system for passive heart rate (HR) and RHR measurements during ordinary smartphone use, using facial video-based photoplethysmography. Our system was developed using 225,773 videos from 495 participants and validated on 185,970 videos from 205 participants in laboratory and free-living conditions – the largest validation study of its kind. Compared to reference electrocardiogram, PHRM achieved a mean absolute percentage error (MAPE) <10% for HR measurements across three skin tone groups of light, medium and dark pigmentation; MAPE for each skin tone group was non-inferior versus the others. Daily RHR measured by PHRM had a mean absolute error <5 bpm compared to a wearable HR tracker, and was associated with known risk factors. These results highlight the potential of smartphones to enable passive and equitable heart health monitoring. View details
    Scaling Wearable Foundation Models
    Girish Narayanswamy
    Kumar Ayush
    Yuzhe Yang
    Orson Xu
    Shun Liao
    Shyam Tailor
    Jake Sunshine
    Tim Althoff
    Shrikanth (Shri) Narayanan
    Jiening Zhan
    Mark Malhotra
    Shwetak Patel
    Samy Abdel-Ghaffar
    Daniel McDuff
    2025
    Preview abstract Wearable sensors have become ubiquitous thanks to a variety of health tracking features. The resulting continuous and longitudinal measurements from everyday life generate large volumes of data. However, making sense of these observations for scientific and actionable insights is non-trivial. Inspired by the empirical success of generative modeling, where large neural networks learn powerful representations from vast amounts of text, image, video, or audio data, we investigate the scaling properties of wearable sensor foundation models across compute, data, and model size. Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, accelerometer, electrodermal activity, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM, a multimodal foundation model built on the largest wearable-signals dataset with the most extensive range of sensor modalities to date. Our results establish the scaling laws of LSM for tasks such as imputation, interpolation and extrapolation across both time and sensor modalities. Moreover, we highlight how LSM enables sample-efficient downstream learning for tasks including exercise and activity recognition. View details
    Preview abstract Recently, decomposing complex problems into simple subtasks--a crucial part of human-like natural planning--to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce Plan-Tuning, a unified post-training framework that (i) distills synthetic task decompositions (termed “planning trajectories”) from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average ~7%. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average ~10% and ~12% performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that Plan-Tuning is an effective strategy for improving task-specific performance of smaller LLMs. View details
    The Anatomy of a Personal Health Agent
    Ahmed Metwally
    Ken Gu
    Jiening Zhan
    Kumar Ayush
    Hong Yu
    Amy Lee
    Qian He
    Zhihan Zhang
    Isaac Galatzer-Levy
    Xavi Prieto
    Andrew Barakat
    Ben Graef
    Yuzhe Yang
    Daniel McDuff
    Brent Winslow
    Shwetak Patel
    Girish Narayanswamy
    Conor Heneghan
    Max Xu
    Jacqueline Shreibati
    Mark Malhotra
    Orson Xu
    Tim Althoff
    Tony Faranesh
    Nova Hammerquist
    Vidya Srinivas
    arXiv (2025)
    Preview abstract Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the solution to fulfill diverse needs from individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health assistant that is able to reason about multimodal data from everyday consumer devices and personal health records. To understand end users’ needs when interacting with such an assistant, we conducted an in-depth analysis of query data from users, alongside qualitative insights from users and experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist subagent: (1) a data science agent that analyzes both personal and population-level time-series wearable and health record data to provide numerical health insights, (2) a health domain expert agent that integrates users’ health and contextual data to generate accurate, personalized insights based on medical and contextual user knowledge, and (3) a health coach agent that synthesizes data insights, drives multi-turn user interactions and interactive goal setting, guiding users using a specified psychological strategy and tracking users’ progress. Furthermore, we propose and develop a multi-agent framework, Personal Health Insight Agent Team (PHIAT), that enables dynamic, personalized interactions to address individual health needs. To evaluate these individual agents and the multi-agent system, we develop a set of N benchmark tasks and conduct both automated and human evaluations, involving 100’s of hours of evaluation from health experts, and 100’s of hours of evaluation from end-users. Our work establishes a strong foundation towards the vision of a personal health assistant accessible to everyone in the future and represents the most comprehensive evaluation of a consumer AI health agent to date. View details
    Preview abstract Scaling inference-time computation in Large Language Models (LLMs) dramatically improves their capabilities for solving complex problems. While test-time scaling has shown promise in many tasks such as code generation and mathematical reasoning, integration of inference-time algorithms into multi-agent frameworks for planning and reasoning remains under-explored. To this end, we explore popular inference-time algorithms—Best of N, Tree of Thought (ToT), and REward BAlanced SEarch (REBASE)—with proposed feedback-driven refinement. Our feedback-driven refinement employs specialized agents: a constraint agent to enforce task instance-specific constraints, and a verifier agent to evaluate plan quality. Furthermore, we hypothesize that test-time scaling can be proportional to instance-level complexity. Thus, we propose an additional selection agent to dynamically optimize algorithm choice. We evaluate our proposed approaches on four different benchmarks, i.e., NATURAL PLAN, GPQA, OlympiadBench, and DocFinQA. Experimental results show that our methods outperform strong baselines, achieving state-of-the-art results in NATURAL PLAN, OlympiadBench , and DocFinQA. Our key findings demonstrate that constraint-guided iterative refinement and algorithm selection improves both planning and downstream reasoning in LLMs View details
    A Scalable Framework for Evaluating Health Language Models
    Neil Mallinar
    Tony Faranesh
    Brent Winslow
    Nova Hammerquist
    Ben Graef
    Cathy Speed
    Mark Malhotra
    Shwetak Patel
    Xavi Prieto
    Daniel McDuff
    Ahmed Metwally
    (2025)
    Preview abstract Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health. View details
    Preview abstract Language models (LM) are capable of remarkably complex linguistic tasks; however, numerical reasoning is an area in which they frequently struggle. An important but rarely evaluated form of reasoning is understanding probability distributions. In this paper we focus on evaluating the probabilistic reasoning capabilities of LMs using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities. We find that zero-shot performance varies dramatically across different families of distributions and that performance can be improved significantly by using anchoring examples (shots) from within a distribution, or to a lesser extent across distributions within the same family. For real-world distributions, the absence of in-context examples can be substituted with context from which the LM can retrieve some statistics. Finally, we show that simply providing the mean and standard deviation of real-world distributions improves performance. To conduct this work, we developed a comprehensive benchmark distribution dataset with associated question-answer pairs that we release publicly, including questions about population health, climate, and finance. View details
    Towards a Personal Health Large Language Model
    Anastasiya Belyaeva
    Nick Furlotte
    Zhun Yang
    Chace Lee
    Erik Schenck
    Yojan Patel
    Jian Cui
    Logan Schneider
    Robby Bryant
    Ryan Gomes
    Allen Jiang
    Roy Lee
    Javier Perez
    Jamie Rogers
    Cathy Speed
    Shyam Tailor
    Megan Walker
    Jeffrey Yu
    Tim Althoff
    Conor Heneghan
    Mark Malhotra
    Shwetak Patel
    Shravya Shetty
    Jiening Zhan
    Yeswanth Subramanian
    Daniel McDuff
    arXiv (2024)
    Preview abstract Large language models (LLMs) can retrieve, reason over, and make inferences about a wide range of information. In health, most LLM efforts to date have focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into clinical tasks, provide a rich, continuous, and longitudinal source of data relevant for personal health monitoring. Here we present a new model, Personal Health Large Language Model (PH-LLM), a version of Gemini fine-tuned for text understanding and reasoning over numerical time-series personal health data for applications in sleep and fitness. To systematically evaluate PH-LLM, we created and curated three novel benchmark datasets that test 1) production of personalized insights and recommendations from measured sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep quality outcomes. For the insights and recommendations tasks we created 857 case studies in sleep and fitness. These case studies, designed in collaboration with domain experts, represent real-world scenarios and highlight the model’s capabilities in understanding and coaching. Through comprehensive human and automatic evaluation of domain-specific rubrics, we observed that both Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. To further assess expert domain knowledge, we evaluated PH-LLM performance on multiple choice question examinations in sleep medicine and fitness. PH-LLM achieved 79% on sleep (N=629 questions) and 88% on fitness (N=99 questions), both of which exceed average scores from a sample of human experts as well as benchmarks for receiving continuing credit in those domains. To enable PH-LLM to predict self-reported assessments of sleep quality, we trained the model to predict self-reported sleep disruption and sleep impairment outcomes from textual and multimodal encoding representations of wearable sensor data. We demonstrate that multimodal encoding is both necessary and sufficient to match performance of a suite of discriminative models to predict these outcomes. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge base and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM. View details