Preview abstract
Although large language models (LLMs) show promise for clinical healthcare applications, their utility for personalized health monitoring using wearable device data remains underexplored. Here we introduce the Personal Health Large Language Model (PH-LLM), designed for applications in sleep and fitness. PH-LLM is a version of the Gemini LLM that was finetuned for text understanding and reasoning when applied to aggregated daily-resolution numerical sensor data. We created three benchmark datasets to assess multiple complementary aspects of sleep and fitness: expert domain knowledge, generation of personalized insights and recommendations and prediction of self-reported sleep quality from longitudinal data. PH-LLM achieved scores that exceeded a sample of human experts on multiple-choice examinations in sleep medicine (79% versus 76%) and fitness (88% versus 71%). In a comprehensive evaluation involving 857 real-world case studies, PH-LLM performed similarly to human experts for fitness-related tasks and improved over the base Gemini model in providing personalized sleep insights. Finally, PH-LLM effectively predicted self-reported sleep quality using a multimodal encoding of wearable sensor data, further demonstrating its ability to effectively contextualize wearable modalities. This work highlights the potential of LLMs to revolutionize personal health monitoring via tailored insights and predictions from wearable data and provides datasets, rubrics and benchmark performance to further accelerate personal health-related LLM research.View details
Preview abstract
Health is a fundamental pillar of human wellness, and the rapid advancements in large language models (LLMs) have driven the development of a new generation of health agents. However, the solution to fulfill diverse needs from individuals in daily non-clinical settings is underexplored. In this work, we aim to build a comprehensive personal health assistant that is able to reason about multimodal data from everyday consumer devices and personal health records. To understand end users’ needs when interacting with such an assistant, we conducted an in-depth analysis of query data from users, alongside qualitative insights from users and experts gathered through a user-centered design process. Based on these findings, we identified three major categories of consumer health needs, each of which is supported by a specialist subagent: (1) a data science agent that analyzes both personal and population-level time-series wearable and health record data to provide numerical health insights, (2) a health domain expert agent that integrates users’ health and contextual data to generate accurate, personalized insights based on medical and contextual user knowledge, and (3) a health coach agent that synthesizes data insights, drives multi-turn user interactions and interactive goal setting, guiding users using a specified psychological strategy and tracking users’ progress. Furthermore, we propose and develop a multi-agent framework, Personal Health Insight Agent Team (PHIAT), that enables dynamic, personalized interactions to address individual health needs. To evaluate these individual agents and the multi-agent system, we develop a set of N benchmark tasks and conduct both automated and human evaluations, involving 100’s of hours of evaluation from health experts, and 100’s of hours of evaluation from end-users. Our work establishes a strong foundation towards the vision of a personal health assistant accessible to everyone in the future and represents the most comprehensive evaluation of a consumer AI health agent to date.View details
Preview abstract
Smartphones are a vital tool for most people. They facilitate many everyday tasks and as a result they have become ubiquitous and indispensable. There are concerns about how the use of these devices may impact mental health and wellbeing. Yet, there are few studies that have reported objective data about phone usage from large and diverse cohorts and studies have found low correlations between subjective and objective smartphone use. In order to better elucidate these complex interactions, it is important to understand and characterize what resembles “normative” smartphone use behavior. In this paper, we present normative patterns of objectively measured phone usage from a large prospective observational study. We analyze a quarter of a million days of phone usage data from 10,099 adult subjects that provides objective longitudinal data over a four week period in the US general population. Contrary to popular belief, our model shows little support for the conclusion that smartphone use predicts mood the following week or that mood predicts smartphone use the following week, with some results differing depending on whether the effects are within-person or between-person. Lastly, while some findings are statistically significant, the effect sizes of these results are minimal, suggesting little to no impact in real-world settings and therefore a lack of clinical significance.View details
Preview abstract
Background:
Commercial wearables such as Fitbit quantify sleep metrics using fixed calendar times as default measurement periods, which may not adequately account for individual variations in sleep patterns. To address this limitation, experts in sleep medicine and wearable technology developed a user-centric algorithm designed to more accurately reflect actual sleep behaviors and improve the validity of wearable-derived sleep metrics.
Objective:
This study aims to describe the development of a new user-centric algorithm, compare its performance with the default calendar-relative algorithm, and provide a practical guide for analyzing All of Us Fitbit sleep data on a cloud-based platform.
Methods:
The default and user-centric algorithms were implemented to preprocess and compute sleep metrics related to schedule, duration, and disturbances using high-resolution Fitbit sleep data from 8563 participants (median age 58.1 years, 6002/8341, 71.96%, female) in the All of Us Research Program (version 7 Controlled Tier). Variations in typical sleep patterns were calculated by examining the differences in the mean number of primary sleep logs classified by each algorithm. Linear mixed-effects models were used to compare differences in sleep metrics across quartiles of variation in typical sleep patterns.
Results:
Out of 8,452,630 total sleep logs collected over a median of 4.2 years of Fitbit monitoring, 401,777 (4.75%) nonprimary sleep logs identified by the default algorithm were reclassified as primary sleep by the user-centric algorithm. Variation in typical sleep patterns ranged from –0.08 to 1. Among participants with the greatest variation in typical sleep patterns, the user-centric algorithm identified significantly more total sleep time (by 17.6 minutes; P<.001), more wake after sleep onset (by 13.9 minutes; P<.001), and lower sleep efficiency (by 2.0%; P<.001), on average. Differences in sleep stage metrics between the 2 algorithms were modest.
Conclusions:
The user-centric algorithm captures the natural variability in sleep schedules, providing an alternative approach to preprocess and evaluate sleep metrics related to schedule, duration, and disturbances. A publicly available R package facilitates the implementation of this algorithm for clinical and translational research.View details
Preview abstract
Introduction:
Wearables offer a scalable, passive and objective measure of sleep health. However, prior reported correlations (spearman) between subjective and wearable derived sleep measures have been modest (rS=0.3-0.46). We set out to determine if wearables adequately capture subjective feelings of sleep disturbance and impairment in a large, diverse ecologically valid sleep study.
Methods:
Subject data (n=2922, mean age= 45.4 (12.6), 74% female) came from the Digital Wellbeing Study: a joint study between the University of Oregon and Google to investigate how smartphone usage impacts well-being. Wearable (Fitbit) derived sleep metrics were summarized across the week prior to the administration of the PROMIS Sleep Disturbance (SD) and Sleep Related Impairment (SR) Short Form surveys. A series of stepwise OLS regressions were used to test the predictive power of each sleep metric over a baseline model of age and sex.
Results:
Sleep variables of total sleep time, resting heart rate, and the variability in total sleep time and restlessness (accelerometer based metric) improved both SI and SD above a baseline model (SIBaseline adjR2=0.087, SDBaseline adjR2=0.024). Deep (e.g. N3) minutes uniquely improved SI model fit, while longest wake length and total wake minutes improved SD fit. REM percent and normalized nightly heart rate did not improve model fit. The final model explained 12.9% of the variance of SI, and 8.4% of the variance of SD. The most predictive single sleep metric was the variability in total sleep time (adjR2=0.104) for SI, and total sleep time for SD (age & sex included). Fitbit’s composite “Sleep Score” was the single best predictor of SD when included in analysis (age and sex excluded).
Conclusion: As demonstrated in previous studies, wearable derived sleep metrics are modest predictors of perceived sleep disturbance or sleep related impairment. Composite metrics that include measures of sleep variability are recommended.
Support: This research was funded by Google Inc.
View details
Preview abstract
Language models (LM) are capable of remarkably complex linguistic tasks; however, numerical reasoning is an area in which they frequently struggle. An important but rarely evaluated form of reasoning is understanding probability distributions. In this paper we focus on evaluating the probabilistic reasoning capabilities of LMs using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities. We find that zero-shot performance varies dramatically across different families of distributions and that performance can be improved significantly by using anchoring examples (shots) from within a distribution, or to a lesser extent across distributions within the same family. For real-world distributions, the absence of in-context examples can be substituted with context from which the LM can retrieve some statistics. Finally, we show that simply providing the mean and standard deviation of real-world distributions improves performance. To conduct this work, we developed a comprehensive benchmark distribution dataset with associated question-answer pairs that we release publicly, including questions about population health, climate, and finance.View details
Preview abstract
Poor sleep health is associated with increased all-cause mortality and incidence of many chronic conditions. Previous studies have relied on cross-sectional and self-reported survey data or polysomnograms, which have limitations with respect to data granularity, sample size and longitudinal information. Here, using objectively measured, longitudinal sleep data from commercial wearable devices linked to electronic health record data from the All of Us Research Program, we show that sleep patterns, including sleep stages, duration and regularity, are associated with chronic disease incidence. Of the 6,785 participants included in this study, 71% were female, 84% self-identified as white and 71% had a college degree; the median age was 50.2 years (interquartile range = 35.7, 61.5) and the median sleep monitoring period was 4.5 years (2.5, 6.5). We found that rapid eye movement sleep and deep sleep were inversely associated with the odds of incident atrial fibrillation and that increased sleep irregularity was associated with increased odds of incident obesity, hyperlipidemia, hypertension, major depressive disorder and generalized anxiety disorder. Moreover, J-shaped associations were observed between average daily sleep duration and hypertension, major depressive disorder and generalized anxiety disorder. These findings show that sleep stages, duration and regularity are all important factors associated with chronic disease development and may inform evidence-based recommendations on healthy sleeping habits.View details
Preview abstract
Large language models (LLMs) can retrieve, reason over, and make inferences about a wide range of information. In health, most LLM efforts to date have focused on clinical tasks. However, mobile and wearable devices, which are rarely integrated into clinical tasks, provide a rich, continuous, and longitudinal source of data relevant for personal health monitoring. Here we present a new model, Personal Health Large Language Model (PH-LLM), a version of Gemini fine-tuned for text understanding and reasoning over numerical time-series personal health data for applications in sleep and fitness. To systematically evaluate PH-LLM, we created and curated three novel benchmark datasets that test 1) production of personalized insights and recommendations from measured sleep patterns, physical activity, and physiological responses, 2) expert domain knowledge, and 3) prediction of self-reported sleep quality outcomes. For the insights and recommendations tasks we created 857 case studies in sleep and fitness. These case studies, designed in collaboration with domain experts, represent real-world scenarios and highlight the model’s capabilities in understanding and coaching. Through comprehensive human and automatic evaluation of domain-specific rubrics, we observed that both Gemini Ultra 1.0 and PH-LLM are not statistically different from expert performance in fitness and, while experts remain superior for sleep, fine-tuning PH-LLM provided significant improvements in using relevant domain knowledge and personalizing information for sleep insights. To further assess expert domain knowledge, we evaluated PH-LLM performance on multiple choice question examinations in sleep medicine and fitness. PH-LLM achieved 79% on sleep (N=629 questions) and 88% on fitness (N=99 questions), both of which exceed average scores from a sample of human experts as well as benchmarks for receiving continuing credit in those domains. To enable PH-LLM to predict self-reported assessments of sleep quality, we trained the model to predict self-reported sleep disruption and sleep impairment outcomes from textual and multimodal encoding representations of wearable sensor data. We demonstrate that multimodal encoding is both necessary and sufficient to match performance of a suite of discriminative models to predict these outcomes. Although further development and evaluation are necessary in the safety-critical personal health domain, these results demonstrate both the broad knowledge base and capabilities of Gemini models and the benefit of contextualizing physiological data for personal health applications as done with PH-LLM.View details
Preview abstract
Study protocol for PEARL Study: To evaluate the impact of two personalized nudging strategies delivered as pop-up notifications via the Fitbit app on user step count. Specifically, to personalize the following parameters of the pop-up notification system: message content, and timing (hr of the day)View details
Preview abstract
This Op-ed is by leaders from the American Heart Association, Digital Medicine Society and Google involved in a Digital Medicine Society-sponsored project on digital measures for physical activity. The Op-ed summarizes evidence that the technology exists today to digitally measure physical activity in the broad population – and, by measuring it the right way, we can embrace it as the ‘6th vital sign’ and enter a new era of healthcare centered on proactive patient care. View details