
Greg Corrado
Greg Corrado is a senior research scientist interested in biological neuroscience, artificial intelligence, and scalable machine learning. He has published in fields ranging across behavioral economics, neuromorphic device physics, systems neuroscience, and deep learning. At Google he has worked for some time on brain inspired computing, and most recently has served as one of the founding members and the co-technical lead of Google's large scale deep neural networks project.
Authored Publications
Sort By
Performance of a Deep Learning Diabetic Retinopathy Algorithm in India
Arthur Brant
Xiang Yin
Lu Yang
Jay Nayar
Divleen Jeji
Sunny Virmani
Anchintha Meenu
Naresh Babu Kannan
Florence Thng
Lily Peng
Ramasamy Kim
JAMA Network Open (2025)
Preview abstract
Importance: While prospective studies have investigated the accuracy of artificial intelligence (AI) for detection of diabetic retinopathy (DR) and diabetic macular edema (DME), to date, little published data exist on the clinical performance of these algorithms.
Objective: To evaluate the clinical performance of an automated retinal disease assessment (ARDA) algorithm in the postdeployment setting at Aravind Eye Hospital in India.
Design, Setting, and Participants: This cross-sectional analysis involved an approximate 1% sample of fundus photographs from patients screened using ARDA. Images were graded via adjudication by US ophthalmologists for DR and DME, and ARDA’s output was compared against the adjudicated grades at 45 sites in Southern India. Patients were randomly selected between January 1, 2019, and July 31, 2023.
Main Outcomes and Measures: Primary analyses were the sensitivity and specificity of ARDA for severe nonproliferative DR (NPDR) or proliferative DR (PDR). Secondary analyses focused on sensitivity and specificity for sight-threatening DR (STDR) (DME or severe NPDR or PDR).
Results: Among the 4537 patients with 4537 images with adjudicated grades, mean (SD) age was 55.2 (11.9) years and 2272 (50.1%) were male. Among the 3941 patients with gradable photographs, 683 (17.3%) had any DR, 146 (3.7%) had severe NPDR or PDR, 109 (2.8%) had PDR, and 398 (10.1%) had STDR. ARDA’s sensitivity and specificity for severe NPDR or PDR were 97.0% (95% CI, 92.6%-99.2%) and 96.4% (95% CI, 95.7%-97.0%), respectively. Positive predictive value (PPV) was 50.7% and negative predictive value (NPV) was 99.9%. The clinically important miss rate for severe NPDR or PDR was 0% (eg, some patients with severe NPDR or PDR were interpreted as having moderate DR and referred to clinic). ARDA’s sensitivity for STDR was 95.9% (95% CI, 93.0%-97.4%) and specificity was 94.9% (95% CI, 94.1%-95.7%); PPV and NPV were 67.9% and 99.5%, respectively.
Conclusions and Relevance: In this cross-sectional study investigating the clinical performance of ARDA, sensitivity and specificity for severe NPDR and PDR exceeded 96% and caught 100% of patients with severe NPDR and PDR for ophthalmology referral. This preliminary large-scale postmarketing report of the performance of ARDA after screening 600 000 patients in India underscores the importance of monitoring and publication an algorithm's clinical performance, consistent with recommendations by regulatory bodies.
View details
Triaging mammography with artificial intelligence: an implementation study
Sarah M. Friedewald
Sunny Jansen
Fereshteh Mahvar
Timo Kohlberger
David V. Schacht
Sonya Bhole
Dipti Gupta
Scott Mayer McKinney
Stacey Caron
David Melnick
Mozziyar Etemadi
Samantha Winter
Alejandra Maciel
Luca Speroni
Martha Sevenich
Arnav Agharwal
Rubin Zhang
Gavin Duggan
Shiro Kadowaki
Atilla Kiraly
Jie Yang
Basil Mustafa
Krish Eswaran
Shravya Shetty
Breast Cancer Research and Treatment (2025)
Preview abstract
Purpose
Many breast centers are unable to provide immediate results at the time of screening mammography which results in delayed patient care. Implementing artificial intelligence (AI) could identify patients who may have breast cancer and accelerate the time to diagnostic imaging and biopsy diagnosis.
Methods
In this prospective randomized, unblinded, controlled implementation study we enrolled 1000 screening participants between March 2021 and May 2022. The experimental group used an AI system to prioritize a subset of cases for same-visit radiologist evaluation, and same-visit diagnostic workup if necessary. The control group followed the standard of care. The primary operational endpoints were time to additional imaging (TA) and time to biopsy diagnosis (TB).
Results
The final cohort included 463 experimental and 392 control participants. The one-sided Mann-Whitney U test was employed for analysis of TA and TB. In the control group, the TA was 25.6 days [95% CI 22.0–29.9] and TB was 55.9 days [95% CI 45.5–69.6]. In comparison, the experimental group's mean TA was reduced by 25% (6.4 fewer days [one-sided 95% CI > 0.3], p<0.001) and mean TB was reduced by 30% (16.8 fewer days; 95% CI > 5.1], p=0.003). The time reduction was more pronounced for AI-prioritized participants in the experimental group. All participants eventually diagnosed with breast cancer were prioritized by the AI.
Conclusions
Implementing AI prioritization can accelerate care timelines for patients requiring additional workup, while maintaining the efficiency of delayed interpretation for most participants. Reducing diagnostic delays could contribute to improved patient adherence, decreased anxiety and addressing disparities in access to timely care.
View details
Closing the AI generalisation gap by adjusting for dermatology condition distribution differences across clinical settings
Rajeev Rikhye
Aaron Loh
Grace Hong
Margaret Ann Smith
Vijaytha Muralidharan
Doris Wong
Michelle Phung
Nicolas Betancourt
Bradley Fong
Rachna Sahasrabudhe
Khoban Nasim
Alec Eschholz
Basil Mustafa
Jan Freyberg
Terry Spitz
Kat Chou
Peggy Bui
Justin Ko
Steven Lin
The Lancet eBioMedicine (2025)
Preview abstract
Background: Generalisation of artificial intelligence (AI) models to a new setting is challenging. In this study, we seek to understand the robustness of a dermatology (AI) model and whether it generalises from telemedicine cases to a new setting including both patient-submitted photographs (“PAT”) and clinician-taken photographs in-clinic (“CLIN”).
Methods: A retrospective cohort study involving 2500 cases previously unseen by the AI model, including both PAT and CLIN cases, from 22 clinics in the San Francisco Bay Area, spanning November 2015 to January 2021. The primary outcome measure for the AI model and dermatologists was the top-3 accuracy, defined as whether their top 3 differential diagnoses contained the top reference diagnosis from a panel of dermatologists per case.
Findings: The AI performed similarly between PAT and CLIN images (74% top-3 accuracy in CLIN vs. 71% in PAT), however, dermatologists were more accurate in PAT images (79% in CLIN vs. 87% in PAT). We demonstrate that demographic factors were not associated with AI or dermatologist errors; instead several categories of conditions were associated with AI model errors (p < 0.05). Resampling CLIN and PAT to match skin condition distributions to the AI development dataset reduced the observed differences (AI: 84% CLIN vs. 79% PAT; dermatologists: 77% CLIN vs. 89% PAT). We demonstrate a series of steps to close the generalisation gap, requiring progressively more information about the new dataset, ranging from the condition distribution to additional training data for rarer conditions. When using additional training data and testing on the dataset without resampling to match AI development, we observed comparable performance from end-to-end AI model fine tuning (85% in CLIN vs. 83% in PAT) vs. fine tuning solely the classification layer on top of a frozen embedding model (86% in CLIN vs. 84% in PAT).
Interpretation: AI algorithms can be efficiently adapted to new settings without additional training data by recalibrating the existing model, or with targeted data acquisition for rarer conditions and retraining just the final layer.
View details
A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models
Heather Cole-Lewis
Nenad Tomašev
Liam McCoy
Leo Anthony Celi
Alanna Walton
Chirag Nagpal
Akeiylah DeWitt
Philip Mansfield
Sushant Prakash
Joelle Barral
Ivor Horn
Karan Singhal
Nature Medicine (2024)
Preview abstract
Large language models (LLMs) hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and our dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed by narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an artificial intelligence (AI) system promotes equitable health outcomes, we hope that it can be leveraged and built upon toward a shared goal of LLMs that promote accessible and equitable healthcare.
View details
Predicting Cardiovascular Disease Risk using Photoplethysmography and Deep Learning
Sebastien Baur
Mayank Daswani
Christina Chen
Mariam Jabara
Babak Behsaz
Shravya Shetty
Goodarz Danaei
Diego Ardila
PLOS Global Public Health, 4(6) (2024), e0003204
Preview abstract
Cardiovascular diseases (CVDs) are responsible for a large proportion of premature deaths in low- and middle-income countries. Early CVD detection and intervention is critical in these populations, yet many existing CVD risk scores require a physical examination or lab measurements, which can be challenging in such health systems due to limited accessibility. We investigated the potential to use photoplethysmography (PPG), a sensing technology available on most smartphones that can potentially enable large-scale screening at low cost, for CVD risk prediction. We developed a deep learning PPG-based CVD risk score (DLS) to predict the probability of having major adverse cardiovascular events (MACE: non-fatal myocardial infarction, stroke, and cardiovascular death) within ten years, given only age, sex, smoking status and PPG as predictors. We compare the DLS with the office-based refit-WHO score, which adopts the shared predictors from WHO and Globorisk scores (age, sex, smoking status, height, weight and systolic blood pressure) but refitted on the UK Biobank (UKB) cohort. All models were trained on a development dataset (141,509 participants) and evaluated on a geographically separate test (54,856 participants) dataset, both from UKB. DLS’s C-statistic (71.1%, 95% CI 69.9–72.4) is non-inferior to office-based refit-WHO score (70.9%, 95% CI 69.7–72.2; non-inferiority margin of 2.5%, p<0.01) in the test dataset. The calibration of the DLS is satisfactory, with a 1.8% mean absolute calibration error. Adding DLS features to the office-based score increases the C-statistic by 1.0% (95% CI 0.6–1.4). DLS predicts ten-year MACE risk comparable with the office-based refit-WHO score. Interpretability analyses suggest that the DLS-extracted features are related to PPG waveform morphology and are independent of heart rate. Our study provides a proof-of-concept and suggests the potential of a PPG-based approach strategies for community-based primary prevention in resource-limited regions.
View details
Differences between Patient and Clinician Submitted Images: Implications for Virtual Care of Skin Conditions
Rajeev Rikhye
Grace Eunhae Hong
Margaret Ann Smith
Aaron Loh
Vijaytha Muralidharan
Doris Wong
Michelle Phung
Nicolas Betancourt
Bradley Fong
Rachna Sahasrabudhe
Khoban Nasim
Alec Eschholz
Kat Chou
Peggy Bui
Justin Ko
Steven Lin
Mayo Clinic Proceedings: Digital Health (2024)
Preview abstract
Objective: To understand and highlight the differences in clinical, demographic, and image quality characteristics between patient-taken (PAT) and clinic-taken (CLIN) photographs of skin conditions.
Patients and Methods: This retrospective study applied logistic regression to data from 2500 deidentified cases in Stanford Health Care’s eConsult system, from November 2015 to January 2021. Cases with undiagnosable or multiple conditions or cases with both patient and clinician image sources were excluded, leaving 628 PAT cases and 1719 CLIN cases. Demographic characteristic factors, such as age and sex were self-reported, whereas anatomic location, estimated skin type, clinical signs and symptoms, condition duration, and condition frequency were summarized from patient health records. Image quality variables such as blur, lighting issues and whether the image contained skin, hair, or nails were estimated through a deep learning model.
Results: Factors that were positively associated with CLIN photographs, post-2020 were as follows: age 60 years or older, darker skin types (eFST V/VI), and presence of skin growths. By contrast, factors that were positively associated with PAT photographs include conditions appearing intermittently, cases with blurry photographs, photographs with substantial nonskin (or nail/hair) regions and cases with more than 3 photographs. Within the PAT cohort, older age was associated with blurry photographs.
Conclusion: There are various demographic, clinical, and image quality characteristic differences between PAT and CLIN photographs of skin concerns. The demographic characteristic differences present important considerations for improving digital literacy or access, whereas the image quality differences point to the need for improved patient education and better image capture workflows, particularly among elderly patients.
View details
Searching for Dermatology Information Online using Images vs Text: a Randomized Study
Jay Hartford
Natalie Salaets
Kimberley Raiford
Jay Nayar
Dounia Berrada
Harsh Kharbanda
Lou Wang
Peggy Bui
medRxiv (2024)
Preview abstract
Background Skin conditions are extremely common worldwide, and are an important cause of both anxiety and morbidity. Since the advent of the internet, individuals have used text-based search (eg, “red rash on arm”) to learn more about concerns on their skin, but this process is often hindered by the inability to accurately describe the lesion’s morphology. In the study, we surveyed respondents’ experiences with an image-based search, compared to the traditional text-based search experience.
Methods An internet-based survey was conducted to evaluate the experience of text-based vs image-based search for skin conditions. We recruited respondents from an existing cohort of volunteers in a commercial survey panel; survey respondents that met inclusion/exclusion criteria, including willingness to take photos of a visible concern on their body, were enrolled. Respondents were asked to use the Google mobile app to conduct both regular text-based search (Google Search) and image-based search (Google Lens) for their concern, with the order of text vs. image search randomized. Satisfaction for each search experience along six different dimensions were recorded and compared, and respondents’ preferences for the different search types along these same six dimensions were recorded.
Results 372 respondents were enrolled in the study, with 44% self-identifying as women, 86% as White and 41% over age 45. The rate of respondents who were at least moderately familiar with searching for skin conditions using text-based search versus image-based search were 81.5% and 63.5%, respectively. After using both search modalities, respondents were highly satisfied with both image-based and text-based search, with >90% at least somewhat satisfied in each dimension and no significant differences seen between text-based and image-based search when examining the responses on an absolute scale per search modality. When asked to directly rate their preferences in a comparative way, survey respondents preferred image-based search over text-based search in 5 out of 6 dimensions, with an absolute 9.9% more preferring image-based search over text-based search overall (p=0.004). 82.5% (95% CI 78.2 - 86.3) reported a preference to leverage image-based search (alone or in combination with text-based search) in future searches. Of those who would prefer to use a combination of both, 64% indicated they would like to start with image-based search, indicating that image-based search may be the preferred entry point for skin-related searches.
Conclusion Despite being less familiar with image-based search upon study inception, survey respondents generally preferred image-based search to text-based search and overwhelmingly wanted to include this in future searches. These results suggest the potential for image-based search to play a key role in people searching for information regarding skin concerns.
View details
Preview abstract
Importance: Interest in artificial intelligence (AI) has reached an all-time high, and health care leaders across the ecosystem are faced with questions about where, when, and how to deploy AI and how to understand its risks, problems, and possibilities.
Observations: While AI as a concept has existed since the 1950s, all AI is not the same. Capabilities and risks of various kinds of AI differ markedly, and on examination 3 epochs of AI emerge. AI 1.0 includes symbolic AI, which attempts to encode human knowledge into computational rules, as well as probabilistic models. The era of AI 2.0 began with deep learning, in which models learn from examples labeled with ground truth. This era brought about many advances both in people’s daily lives and in health care. Deep learning models are task-specific, meaning they do one thing at a time, and they primarily focus on classification and prediction. AI 3.0 is the era of foundation models and generative AI. Models in AI 3.0 have fundamentally new (and potentially transformative) capabilities, as well as new kinds of risks, such as hallucinations. These models can do many different kinds of tasks without being retrained on a new dataset. For example, a simple text instruction will change the model’s behavior. Prompts such as “Write this note for a specialist consultant” and “Write this note for the patient’s mother” will produce markedly different content.
Conclusions and Relevance: Foundation models and generative AI represent a major revolution in AI’s capabilities, ffering tremendous potential to improve care. Health care leaders are making decisions about AI today. While any heuristic omits details and loses nuance, the framework of AI 1.0, 2.0, and 3.0 may be helpful to decision-makers because each epoch has fundamentally different capabilities and risks.
View details
Conversational AI in health: Design considerations from a Wizard-of-Oz dermatology case study with users, clinicians and a medical LLM
Brenna Li
Amy Wang
Patricia Strachan
Julie Anne Seguin
Sami Lachgar
Karyn Schroeder
Renee Wong
Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery, pp. 10
Preview abstract
Although skin concerns are common, access to specialist care is limited. Artificial intelligence (AI)-assisted tools to support medical decisions may provide patients with feedback on their concerns while also helping ensure the most urgent cases are routed to dermatologists. Although AI-based conversational agents have been explored recently, how they are perceived by patients and clinicians is not well understood. We conducted a Wizard-of-Oz study involving 18 participants with real skin concerns. Participants were randomly assigned to interact with either a clinician agent (portrayed by a dermatologist) or an LLM agent (supervised by a dermatologist) via synchronous multimodal chat. In both conditions, participants found the conversation to be helpful in understanding their medical situation and alleviate their concerns. Through qualitative coding of the conversation transcripts, we provide insight on the importance of empathy and effective information-seeking. We conclude with design considerations for future AI-based conversational agents in healthcare settings.
View details
Creating an Empirical Dermatology Dataset Through Crowdsourcing With Web Search Advertisements
Abbi Ward
Jimmy Li
Julie Wang
Sriram Lakshminarasimhan
Ashley Carrick
Jay Hartford
Pradeep Kumar S
Sunny Virmani
Renee Wong
Margaret Ann Smith
Dawn Siegel
Steven Lin
Justin Ko
JAMA Network Open (2024)
Preview abstract
Importance: Health datasets from clinical sources do not reflect the breadth and diversity of disease, impacting research, medical education, and artificial intelligence tool development. Assessments of novel crowdsourcing methods to create health datasets are needed.
Objective: To evaluate if web search advertisements (ads) are effective at creating a diverse and representative dermatology image dataset.
Design, Setting, and Participants: This prospective observational survey study, conducted from March to November 2023, used Google Search ads to invite internet users in the US to contribute images of dermatology conditions with demographic and symptom information to the Skin Condition Image Network (SCIN) open access dataset. Ads were displayed against dermatology-related search queries on mobile devices, inviting contributions from adults after a digital informed consent process. Contributions were filtered for image safety and measures were taken to protect privacy. Data analysis occurred January to February 2024.
Exposure: Dermatologist condition labels as well as estimated Fitzpatrick Skin Type (eFST) and estimated Monk Skin Tone (eMST) labels.
Main Outcomes and Measures: The primary metrics of interest were the number, quality, demographic diversity, and distribution of clinical conditions in the crowdsourced contributions. Spearman rank order correlation was used for all correlation analyses, and the χ2 test was used to analyze differences between SCIN contributor demographics and the US census.
Results: In total, 5749 submissions were received, with a median of 22 (14-30) per day. Of these, 5631 (97.9%) were genuine images of dermatological conditions. Among contributors with self-reported demographic information, female contributors (1732 of 2596 contributors [66.7%]) and younger contributors (1329 of 2556 contributors [52.0%] aged <40 years) had a higher representation in the dataset compared with the US population. Of 2614 contributors who reported race and ethnicity, 852 (32.6%) reported a racial or ethnic identity other than White. Dermatologist confidence in assigning a differential diagnosis increased with the number of self-reported demographic and skin-condition–related variables (Spearman R = 0.1537; P < .001). Of 4019 contributions reporting duration since onset, 2170 (54.0%) reported onset within less than 7 days of submission. Of the 2835 contributions that could be assigned a dermatological differential diagnosis, 2523 (89.0%) were allergic, infectious, or inflammatory conditions. eFST and eMST distributions reflected the geographical origin of the dataset.
Conclusions and Relevance: The findings of this survey study suggest that search ads are effective at crowdsourcing dermatology images and could therefore be a useful method to create health datasets. The SCIN dataset bridges important gaps in the availability of images of common, short-duration skin conditions.
View details