Active Deep Learning to Detect Demographic Traits in Free-Form Clinical Notes

Danny Vainstein
Roni Rosenfeld
Tzvika Hartman
Journal of Biomedical Informatics (2020)
Google Scholar

Abstract

The free-form portions of clinical notes are a significant source of information for research, but
before they can be used, they must be de-identified to protect patients' privacy. De-identification
efforts have focused on known identifier types (names, ages, dates, addresses, ID's, etc.).
However, a note can contain residual "Demographic Traits" (DTs), unique enough to re-identify
the patient when combined with other such facts. Here we examine whether any residual risks
remain after removing these identifiers. After manually annotating over 140,000 words worth of
medical notes, we found no remaining directly identifying information, and a low prevalence of
demographic traits, such as marital status or housing type. We developed an annotation guide
to the discovered Demographic Traits (DTs) and used it to label MIMIC-III and i2b2-2006 clinical
notes as test sets. We then designed a "bootstrapped" active learning iterative process for
identifying DTs: we tentatively labeled as positive all sentences in the DT-rich note sections,
used these to train a binary classifier, manually corrected acute errors, and retrained the
classifier. This train-and-correct process may be iterated. Our active learning process
significantly improved the classifier's accuracy. Moreover, our BERT-based model outperformed
non-neural models when trained on both tentatively labeled data and manually relabeled
examples. To facilitate future research and benchmarking, we also produced and made publicly
available our human annotated DT-tagged datasets. We conclude that directly identifying
information is virtually non-existent in the multiple medical note types we investigated.
Demographic traits are present in medical notes, but can be detected with high accuracy using
a cost-effective human-in-the-loop active learning process, and redacted if desired.