Jonas Kemp

Jonas Kemp

Jonas is a research engineer in Google Health. He joined Google as an AI resident in 2017, investigating deep learning methods for modeling and understanding multimodal data in electronic health records. His research interests center on improving the quality, actionability, and reliability of clinical risk predictions, with a particular focus on natural language processing and representation learning methods. Jonas earned his BA in Human Biology and his MS in Computer Science from Stanford University.
Authored Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    AfriMed-QA: A Pan-African Multi-Specialty Medical Question-Answering Benchmark Dataset
    Tobi Olatunji
    Abraham Toluwase Owodunni
    Charles Nimo
    Jennifer Orisakwe
    Henok Biadglign Ademtew
    Chris Fourie
    Foutse Yuehgoh
    Stephen Moore
    Mardhiyah Sanni
    Emmanuel Ayodele
    Timothy Faniran
    Bonaventure F. P. Dossou
    Fola Omofoye
    Wendy Kinara
    Tassallah Abdullahi
    Michael Best
    2025
    Preview abstract Recent advancements in large language model (LLM) performance on medical multiple-choice question (MCQ) benchmarks have stimulated significant interest from patients and healthcare providers globally. Particularly in low- and middle-income countries (LMICs) facing acute physician shortages and lack of specialists, LLMs offer a potentially scalable pathway to enhance healthcare access and reduce costs. However, LLM training data is sourced from predominantly Western text, existing benchmarks are predominantly Western-centric, limited to MCQs, and focused on a narrow range of clinical specialties, raising concerns about their applicability in the Global South, particularly across Africa where localized medical knowledge and linguistic diversity are often underrepresented. In this work, we introduce AfriMed-QA, the first large-scale multi-specialty Pan-African medical Question-Answer (QA) dataset designed to evaluate and develop equitable and effective LLMs for African healthcare. It contains 3,000 multiple-choice professional medical exam questions with answers and rationale, 1,500 short answer questions (SAQ) with long-from answers, and 5,500 consumer queries, sourced from over 60 medical schools across 15 countries, covering 32 medical specialties. We further rigorously evaluate multiple open, closed, general, and biomedical LLMs across multiple axes including accuracy, consistency, factuality, bias, potential for harm, local geographic relevance, medical reasoning, and recall. We believe this dataset provides a valuable resource for practical application of large language models in African healthcare and enhances the geographical diversity of health-LLM benchmark datasets. View details
    Deciphering clinical abbreviations with a privacy protecting machine learning system
    Alvin Rajkomar
    Eric Loreaux
    Yuchen Liu
    Benny Li
    Ming-Jun Chen
    Yi Zhang
    Afroz Mohiuddin
    Juraj Gottweis
    Nature Communications (2022)
    Preview abstract Physicians write clinical notes with abbreviations and shorthand that are difficult to decipher. Abbreviations can be clinical jargon (writing “HIT” for “heparin induced thrombocytopenia”), ambiguous terms that require expertise to disambiguate (using “MS” for “multiple sclerosis” or “mental status”), or domain-specific vernacular (“cb” for “complicated by”). Here we train machine learning models on public web data to decode such text by replacing abbreviations with their meanings. We report a single translation model that simultaneously detects and expands thousands of abbreviations in real clinical notes with accuracies ranging from 92.1%-97.1% on multiple external test datasets. The model equals or exceeds the performance of board-certified physicians (97.6% vs 88.7% total accuracy). Our results demonstrate a general method to contextually decipher abbreviations and shorthand that is built without any privacy-compromising data. View details
    Instability in clinical risk prediction models using deep learning
    Daniel Lopez-Martinez
    Alex Yakubovich
    Martin Seneviratne
    Akshit Tyagi
    Ethan Steinberg
    N. Lance Downing
    Ron C. Li
    Keith E. Morse
    Nigam H. Shah
    Ming-Jun Chen
    Proceedings of the 2nd Machine Learning for Health symposium, PMLR (2022), pp. 552-565
    Preview abstract While it has been well known in the ML community that deep learning models suffer from instability, the consequences for healthcare deployments are under-characterised. We study the stability of different model architectures trained on electronic health records, using a set of outpatient prediction tasks as a case study. We show that repeated training runs of the same deep learning model on the same training data can result in significantly different outcomes at a patient level even though global performance metrics remain stable. We propose two stability metrics for measuring the effect of randomness of model training, as well as mitigation strategies for improving model stability. View details
    User-centred design for machine learning in health care: a case study from care management
    Birju Patel
    Daniel Lopez-martinez
    Doris Wong
    Eric Loreaux
    Janjri Desai
    Jonathan Chen
    Lance Downing
    Lutz Thomas Finger
    Martin Gamunu Seneviratne
    Ming-Jun Chen
    Nigam Shah
    Ron Li
    BMJ Health & Care Informatics (2022)
    Preview abstract Objectives: Few machine learning (ML) models are successfully deployed in clinical practice. One of the common pitfalls across the field is inappropriate problem formulation: designing ML to fit the data rather than to address a real-world clinical pain point. Methods: We introduce a practical toolkit for user-centred design consisting of four questions covering: (1) solvable pain points, (2) the unique value of ML (eg, automation and augmentation), (3) the actionability pathway and (4) the model’s reward function. This toolkit was implemented in a series of six participatory design workshops with care managers in an academic medical centre. Results: Pain points amenable to ML solutions included outpatient risk stratification and risk factor identification. The endpoint definitions, triggering frequency and evaluation metrics of the proposed risk scoring model were directly influenced by care manager workflows and real-world constraints. Conclusions: Integrating user-centred design early in the ML life cycle is key for configuring models in a clinically actionable way. This toolkit can guide problem selection and influence choices about the technical setup of the ML problem. View details
    Analyzing the Role of Model Uncertainty for Electronic Health Records
    Edward Choi
    Jeremy Nixon
    Ghassen Jerfel
    ACM Conference on Health, Inference, and Learning (ACM CHIL) (2020)
    Preview abstract In medicine, both ethical and monetary costs of incorrect predictions can be significant, and the complexity of the problems often necessitates increasingly complex models. Recent work has shown that changing just the random seed is enough for otherwise well-tuned deep neural networks to vary in their individual predicted probabilities. In light of this, we investigate the role of model uncertainty methods in the medical domain. Using RNN ensembles and various Bayesian RNNs, we show that population-level metrics, such as AUC-PR, AUC-ROC, log-likelihood, and calibration error, do not capture model uncertainty. Meanwhile, the presence of significant variability in patient-specific predictions and optimal decisions motivates the need for capturing model uncertainty. Understanding the uncertainty for individual patients is an area with clear clinical impact, such as determining when a model decision is likely to be brittle. We further show that RNNs with only Bayesian embeddings can be a more efficient way to capture model uncertainty compared to ensembles, and we analyze how model uncertainty is impacted across individual input features and patient subgroups. View details
    Preview abstract Clinical notes in electronic health records contain highly heterogeneous writing styles, including non-standard terminology or abbreviations. Using these notes in predictive modeling has traditionally required preprocessing (e.g. taking frequent terms or topic modeling) that removes much of the richness of the source data. We propose a pretrained hierarchical recurrent neural network model that parses minimally processed clinical notes in an intuitive fashion, and show that it improves performance for discharge diagnosis classification tasks on the Medical Information Mart for Intensive Care III (MIMIC-III) dataset, compared to models that treat the notes as an unordered collection of terms or that conduct no pretraining. We also apply an attribution technique to examples to identify the words that the model uses to make its prediction, and show the importance of the words' nearby context. View details