Building a Best-in-Class De-identification Tool for Electronic Medical Records Through Ensemble Learning

March 26,2021

Dec. 23, 2020

Abstract: The natural language portions of an electronic health record (EHR) communicate critical information about disease and treatment progression. However, the presence of personally identifying information in this data constrains its broad reuse. In the United States, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) provides a de-identification standard for the removal of protected health information (PHI). Despite continuous improvements in methods for the automated detection of PHI over time, the residual identifiers in clinical notes continue to pose significant challenges – often requiring manual validation and correction that is not scalable to generate the amount of data needed for modern machine learning tools. In this paper, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep learning models and rule based methods, supported by heuristics for detecting PHI in EHR data. Upon detection of PHI, the system transforms these detected identifiers into plausible, though fictional, surrogates to further obfuscate any leaked identifier. We evaluated the system with a publicly available dataset of 515 notes from the I2B2 2014 de-identification challenge and a dataset of 10,000 notes from the Mayo Clinic. We compared our approach with other existing tools considered best-in-class. The results indicated a recall of 0.992 and 0.994 and a precision of 0.979 and 0.967 on the I2B2 and the Mayo Clinic data, respectively.

Karthik Murugadoss1, Ajit Rajasekharan1, Bradley Malin PhD2, Vineet Agarwal1, Sairam Bade1, Jeff R. Anderson PhD3, Jason L. Ross1, William A. Faubion Jr.3, John D. Halamka MD3, Venky Soundararajan1*, Sankar Ardhanari1*
1 nference, Cambridge, MA 02142
2 Vanderbilt University Medical Center, Nashville TN
3 Mayo Clinic, Rochester, MN 55905
Correspondence: Venky Soundararajan (
Correspondence: Andrew D Badley (
The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.