
Jonathan Krause
Research Areas
Authored Publications
Sort By
Performance of a Deep Learning Diabetic Retinopathy Algorithm in India
Arthur Brant
Xiang Yin
Lu Yang
Jay Nayar
Divleen Jeji
Sunny Virmani
Anchintha Meenu
Naresh Babu Kannan
Florence Thng
Lily Peng
Ramasamy Kim
JAMA Network Open (2025)
Preview abstract
Importance: While prospective studies have investigated the accuracy of artificial intelligence (AI) for detection of diabetic retinopathy (DR) and diabetic macular edema (DME), to date, little published data exist on the clinical performance of these algorithms.
Objective: To evaluate the clinical performance of an automated retinal disease assessment (ARDA) algorithm in the postdeployment setting at Aravind Eye Hospital in India.
Design, Setting, and Participants: This cross-sectional analysis involved an approximate 1% sample of fundus photographs from patients screened using ARDA. Images were graded via adjudication by US ophthalmologists for DR and DME, and ARDA’s output was compared against the adjudicated grades at 45 sites in Southern India. Patients were randomly selected between January 1, 2019, and July 31, 2023.
Main Outcomes and Measures: Primary analyses were the sensitivity and specificity of ARDA for severe nonproliferative DR (NPDR) or proliferative DR (PDR). Secondary analyses focused on sensitivity and specificity for sight-threatening DR (STDR) (DME or severe NPDR or PDR).
Results: Among the 4537 patients with 4537 images with adjudicated grades, mean (SD) age was 55.2 (11.9) years and 2272 (50.1%) were male. Among the 3941 patients with gradable photographs, 683 (17.3%) had any DR, 146 (3.7%) had severe NPDR or PDR, 109 (2.8%) had PDR, and 398 (10.1%) had STDR. ARDA’s sensitivity and specificity for severe NPDR or PDR were 97.0% (95% CI, 92.6%-99.2%) and 96.4% (95% CI, 95.7%-97.0%), respectively. Positive predictive value (PPV) was 50.7% and negative predictive value (NPV) was 99.9%. The clinically important miss rate for severe NPDR or PDR was 0% (eg, some patients with severe NPDR or PDR were interpreted as having moderate DR and referred to clinic). ARDA’s sensitivity for STDR was 95.9% (95% CI, 93.0%-97.4%) and specificity was 94.9% (95% CI, 94.1%-95.7%); PPV and NPV were 67.9% and 99.5%, respectively.
Conclusions and Relevance: In this cross-sectional study investigating the clinical performance of ARDA, sensitivity and specificity for severe NPDR and PDR exceeded 96% and caught 100% of patients with severe NPDR and PDR for ophthalmology referral. This preliminary large-scale postmarketing report of the performance of ARDA after screening 600 000 patients in India underscores the importance of monitoring and publication an algorithm's clinical performance, consistent with recommendations by regulatory bodies.
View details
Generative AI for medical education: Insights from a case study with medical students and an AI tutor for clinical reasoning
Amy Wang
Roma Ruparel
Paul Jhun
Julie Anne Seguin
Patricia Strachan
Renee Wong
2025
Preview abstract
Generative Artificial Intelligence (AI), particularly Large Language Models (LLMs), have demonstrated significant potential in clinical reasoning skills such as history-taking and differential diagnosis generation—critical aspects of medical education. This work explores how LLMs can augment medical curricula through interactive learning. We conducted a participatory design process with medical students, residents and medical education experts to co-create an AI-powered tutor prototype for clinical reasoning. As part of the co-design process, we conducted a qualitative user study, investigating learning needs and practices via interviews, and conducting concept evaluations through interactions with the prototype. Findings highlight the challenges learners face in transitioning from theoretical knowledge to practical application, and how an AI tutor can provide personalized practice and feedback. We conclude with design considerations, emphasizing the importance of context-specific knowledge and emulating positive preceptor traits, to guide the development of AI tools for medical education.
View details
Validation of a Deep Learning Model for Diabetic Retinopathy on Patients with Young-Onset Diabetes
Tony Tan-Torres
Pradeep Praveen
Divleen Jeji
Arthur Brant
Xiang Yin
Lu Yang
Tayyeba Ali
Ilana Traynis
Dushyantsinh Jadeja
Rajroshan Sawhney
Sunny Virmani
Pradeep Venkatesh
Nikhil Tandon
Ophthalmology and Therapy (2025)
Preview abstract
Introduction
While many deep learning systems (DLSs) for diabetic retinopathy (DR) have been developed and validated on cohorts with an average age of 50s or older, fewer studies have examined younger individuals. This study aimed to understand DLS performance for younger individuals, who tend to display anatomic differences, such as prominent retinal sheen. This sheen can be mistaken for exudates or cotton wool spots, and potentially confound DLSs.
Methods
This was a prospective cross-sectional cohort study in a “Diabetes of young” clinic in India, enrolling 321 individuals between ages 18 and 45 (98.8% with type 1 diabetes). Participants had fundus photographs taken and the photos were adjudicated by experienced graders to obtain reference DR grades. We defined a younger cohort (age 18–25) and an older cohort (age 26–45) and examined differences in DLS performance between the two cohorts. The main outcome measures were sensitivity and specificity for DR.
Results
Eye-level sensitivity for moderate-or-worse DR was 97.6% [95% confidence interval (CI) 91.2, 98.2] for the younger cohort and 94.0% [88.8, 98.1] for the older cohort (p = 0.418 for difference). The specificity for moderate-or-worse DR significantly differed between the younger and older cohorts, 97.9% [95.9, 99.3] and 92.1% [87.6, 96.0], respectively (p = 0.008). Similar trends were observed for diabetic macular edema (DME); sensitivity was 79.0% [57.9, 93.6] for the younger cohort and 77.5% [60.8, 90.6] for the older cohort (p = 0.893), whereas specificity was 97.0% [94.5, 99.0] and 92.0% [88.2, 95.5] (p = 0.018). Retinal sheen presence (94% of images) was associated with DME presence (p < 0.0001). Image review suggested that sheen presence confounded reference DME status, increasing noise in the labels and depressing measured sensitivity. The gradability rate for both DR and DME was near-perfect (99% for both).
Conclusion
DLS-based DR screening performed well in younger individuals aged 18–25, with comparable sensitivity and higher specificity compared to individuals aged 26–45. Sheen presence in this cohort made identification of DME difficult for graders and depressed measured DLS sensitivity; additional studies incorporating optical coherence tomography may improve accuracy of measuring DLS DME sensitivity.
View details
Safety principles for medical summarization using generative AI
Dillon Obika
Christopher Kelly
Nicola Ding
Chris Farrance
Praney Mittal
Donny Cheung
Heather Cole-Lewis
Madeleine Elish
Nature Medicine (2024)
Preview abstract
The introduction of Generative AI, particularly large language models presents exciting opportunities for healthcare. However their novel capabilities also have the potential to introduce novel risks and hazards. This paper explores the unique safety challenges associated with LLMs in healthcare, using medical text summarization as a motivating example. Using MedLM as a case example, we propose leveraging existing standards and guidance while developing novel approaches tailored to the specific characteristics of LLMs.
View details
Lessons learned from translating AI from development to deployment in healthcare
Sunny Virmani
Jay Nayar
Elin Rønby Pedersen
Divleen Jeji
Lily Peng
Nature Medicine (2023)
Preview abstract
The application of an artificial intelligence (AI)-based screening tool for retinal disease in India and Thailand highlighted the myths and reality of introducing medical AI, which may form a framework for subsequent tools.
View details
Domain-specific optimization and diverse evaluation of self-supervised models for histopathology
Jeremy Lai
Supriya Vijay
Jessica Loo
Saurabh Vyawahare
Saloni Agarwal
Fayaz Jamil
Cameron Chen
arXiv (2023)
Preview abstract
Task-specific deep learning models in histopathology offer promising opportunities for improving diagnosis, clinical research, and precision medicine. However, development of such models is often limited by availability of high-quality data. Foundation models in histopathology that learn general representations across a wide range of tissue types, diagnoses, and magnifications offer the potential to reduce the data, compute, and technical expertise necessary to develop task-specific deep learning models with the required level of model performance. In this work, we describe the development and evaluation of foundation models for histopathology via self-supervised learning (SSL). We first establish a diverse set of benchmark tasks involving 17 unique tissue types and 12 unique cancer types and spanning different optimal magnifications and task types. Next, we use this benchmark to explore and evaluate histopathology-specific SSL methods followed by further evaluation on held out patch-level and weakly supervised tasks. We found that standard SSL methods thoughtfully applied to histopathology images are performant across our benchmark tasks and that domain-specific methodological improvements can further increase performance. Our findings reinforce the value of using domain-specific SSL methods in pathology, and establish a set of high quality foundation models to enable further research across diverse applications.
View details
A deep learning model for novel systemic biomarkers in photographs of the external eye: a retrospective study
Ilana Traynis
Christina Chen
Akib Uddin
Jorge Cuadros
Lauren P. Daskivich
April Y. Maa
Ramasamy Kim
Eugene Yu-Chuan Kang
Lily Peng
Avinash Varadarajan
The Lancet Digital Health (2023)
Preview abstract
Background
Photographs of the external eye were recently shown to reveal signs of diabetic retinal disease and elevated glycated haemoglobin. This study aimed to test the hypothesis that external eye photographs contain information about additional systemic medical conditions.
Methods
We developed a deep learning system (DLS) that takes external eye photographs as input and predicts systemic parameters, such as those related to the liver (albumin, aspartate aminotransferase [AST]); kidney (estimated glomerular filtration rate [eGFR], urine albumin-to-creatinine ratio [ACR]); bone or mineral (calcium); thyroid (thyroid stimulating hormone); and blood (haemoglobin, white blood cells [WBC], platelets). This DLS was trained using 123 130 images from 38 398 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA, USA. Evaluation focused on nine prespecified systemic parameters and leveraged three validation sets (A, B, C) spanning 25 510 patients with and without diabetes undergoing eye screening in three independent sites in Los Angeles county, CA, and the greater Atlanta area, GA, USA. We compared performance against baseline models incorporating available clinicodemographic variables (eg, age, sex, race and ethnicity, years with diabetes).
Findings
Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST >36.0 U/L, calcium <8.6 mg/dL, eGFR <60.0 mL/min/1.73 m2, haemoglobin <11.0 g/dL, platelets <150.0 × 103/μL, ACR ≥300 mg/g, and WBC <4.0 × 103/μL on validation set A (a population resembling the development datasets), with the area under the receiver operating characteristic curve (AUC) of the DLS exceeding that of the baseline by 5.3–19.9% (absolute differences in AUC). On validation sets B and C, with substantial patient population differences compared with the development datasets, the DLS outperformed the baseline for ACR ≥300.0 mg/g and haemoglobin <11.0 g/dL by 7.3–13.2%.
Interpretation
We found further evidence that external eye photographs contain biomarkers spanning multiple organ systems. Such biomarkers could enable accessible and non-invasive screening of disease. Further work is needed to understand the translational implications.
View details
Performance of a Diabetic Retinopathy Artificial Intelligence Algorithm for Ultra-widefield Imaging
Tunde Peto
Lloyd Aiello
Srinivas R Sadda
Drew Lewis
Anne Marie Cairns
Dana Keane
Sunny Virmani
Jerry Cavallerano
Barba Hamill
Lily Peng
Sara Ellen Godek
Lu Yang
Naho Kitade
Kira Whitehouse
ARVO (2022)
Preview abstract
Purpose: To evaluate the performance of a deep learning model for diabetic retinopathy (DR) and diabetic macular edema screening when using ultra-widefield (UWF) imaging.
Methods: For model development, 67,200 UWF images were collected from DR programs and ophthalmology clinics worldwide. 30,836 images were double graded and adjudicated at 8 grading centres by 125 certified graders using ETDRS extension of the Modified Airlie House Classification of Diabetic Retinopathy following the JVN Clinical Trial Ultrawide Field Grading Manual v1.0. The grading system used traditional ETDRS 7-SF field definition as well as extended fields 3-7 to evaluate the retinal periphery. A further 36,364 UWF images were graded using a grading protocol based on the ICDR classification. The dataset was split into training, tuning and testing. The final DR model is an ensemble of 10 EfficientNet-b0 neural networks, independently trained with standard image augmentation techniques. For model validation, two independent sets of images were collected. Model performance was evaluated by comparing its predictions to the adjudicated ground truth for both sets of images.
Results: Prior to clinical validation, the model performance was internally evaluated on an independent set of 1967 images, of which 1050 were graded via adjudication as negative for more than mild diabetic retinopathy (mtmDR negative), and 917 as having referable diabetic retinopathy (mtmDR positive). The overall performance (Table 1) was weighted by target DR distribution. Clinical validation evaluated an independent data set of 420 images selected to achieve a target distribution that enabled appropriate confidence intervals for mtmDR sensitivity and specificity A panel of three graders adjudicated these 420 images and assessed 241 as mtmDR negative, 179 as mtmDR positive and 135 as vtDR positive. Model’s performance on the clinical validation set is shown in Table 2.
Conclusions: The deep learning model was developed with high quality graded UWF images and performed at a level that highly suggests usefulness in a clinical screening setting. A large, prospective multi-center clinical trial is currently evaluating the performance of a similar model in a real-world clinical setting.
This abstract was presented at the 2022 ARVO Annual Meeting, held in Denver, CO, May 1-4, 2022, and virtually.
View details
Longitudinal Screening for Diabetic Retinopathy in a Nationwide Screening Program: Comparing Deep Learning and Human Graders
Jirawut Limwattanayingyong
Variya Nganthavee
Kasem Seresirikachorn
Tassapol Singalavanija
Ngamphol Soonthornworasiri
Varis Ruamviboonsuk
Chetan Rao
Rajiv Raman
Andrzej Grzybowski
Lily Hao Yi Peng
Fred Hersch
Richa Tiwari, PhD
Dr. Paisan Raumviboonsuk
Journal of Diabetes Research (2020)
Preview abstract
Objective.
To evaluate diabetic retinopathy (DR) screening via deep learning (DL) and trained human graders (HG) in a longitudinal cohort, as case spectrum shifts based on treatment referral and new-onset DR.
Methods.
We randomly selected patients with diabetes screened twice, two years apart within a nationwide screening program. The reference standard was established via adjudication by retina specialists. Each patient’s color fundus photographs were graded, and a patient was considered as having sight-threatening DR (STDR) if the worse eye had severe nonproliferative DR, proliferative DR, or diabetic macular edema. We compared DR screening via two modalities: DL and HG. For each modality, we simulated treatment referral by excluding patients with detected STDR from the second screening using that modality.
Results.
There were 5,738 patients (12.3% STDR) in the first screening. DL and HG captured different numbers of STDR cases, and after simulated referral and excluding ungradable cases, 4,148 and 4,263 patients remained in the second screening, respectively. The STDR prevalence at the second screening was 5.1% and 6.8% for DL- and HG-based screening, respectively. Along with the prevalence decrease, the sensitivity for both modalities decreased from the first to the second screening (DL: from 95% to 90%, p=0.008; HG: from 74% to 57%, p<0.001). At both the first and second screenings, the rate of false negatives for the DL was a fifth that of HG (0.5-0.6% vs. 2.9-3.2%).
Conclusion.
On 2-year longitudinal follow-up of a DR screening cohort, STDR prevalence decreased for both DL- and HG-based screening. Follow-up screenings in longitudinal DR screening can be more difficult and induce lower sensitivity for both DL and HG, though the false negative rate was substantially lower for DL. Our data may be useful for health-economics analyses of longitudinal screening settings.
View details
Deep Learning and Glaucoma Specialists: The Relative Importance of Optic Disc Features to Predict Glaucoma Referral in Fundus Photographs
Sonia Phene
Carter Dunn
Naho Kitade
Derek Wu
Ashish Bora
Anita Misra
Abigail Huang
Arielle Spitze
Felipe Medeiros
April Maa
Monica Gandhi
Lily Peng
Ophthalmology (2019)
Preview abstract
Purpose
To develop and validate a deep learning (DL) algorithm that predicts referable glaucomatous optic neuropathy (GON) and optic nerve head (ONH) features from color fundus images, to determine the relative importance of these features in referral decisions by glaucoma specialists (GSs) and the algorithm, and to compare the performance of the algorithm with eye care providers.
Design
Development and validation of an algorithm.
Participants
Fundus images from screening programs, studies, and a glaucoma clinic.
Methods
A DL algorithm was trained using a retrospective dataset of 86 618 images, assessed for glaucomatous ONH features and referable GON (defined as ONH appearance worrisome enough to justify referral for comprehensive examination) by 43 graders. The algorithm was validated using 3 datasets: dataset A (1205 images, 1 image/patient; 18.1% referable), images adjudicated by panels of GSs; dataset B (9642 images, 1 image/patient; 9.2% referable), images from a diabetic teleretinal screening program; and dataset C (346 images, 1 image/patient; 81.7% referable), images from a glaucoma clinic.
Main Outcome Measures
The algorithm was evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity for referable GON and glaucomatous ONH features.
Results
The algorithm’s AUC for referable GON was 0.945 (95% confidence interval [CI], 0.929–0.960) in dataset A, 0.855 (95% CI, 0.841–0.870) in dataset B, and 0.881 (95% CI, 0.838–0.918) in dataset C. Algorithm AUCs ranged between 0.661 and 0.973 for glaucomatous ONH features. The algorithm showed significantly higher sensitivity than 7 of 10 graders not involved in determining the reference standard, including 2 of 3 GSs, and showed higher specificity than 3 graders (including 1 GS), while remaining comparable to others. For both GSs and the algorithm, the most crucial features related to referable GON were: presence of vertical cup-to-disc ratio of 0.7 or more, neuroretinal rim notching, retinal nerve fiber layer defect, and bared circumlinear vessels.
Conclusions
A DL algorithm trained on fundus images alone can detect referable GON with higher sensitivity than and comparable specificity to eye care providers. The algorithm maintained good performance on an independent dataset with diagnoses based on a full glaucoma workup.
View details