Scalable Natural Language De-identification based on Machine Learning Approaches
Electronic medical record (EMR) systems have been progressively adopted in numerous aspects of clinical care and healthcare endeavors. As the quantity and diversity of such data grows, so too has its repurposing to support secondary use in a number of settings (e.g., public health and biomedical research). However, the dissemination of such data has been relatively limited to structured data, as documents that contain natural language (e.g., clinical communications between clinicians) has posed concerns over the extent to which the privacy of the corresponding patients can be upheld. To mitigate this concern, various federal and state laws, and the agencies that oversee them, recommending minimizing the amount of information disclosed and adhering to de-identification principles, such as that specified in the Privacy Rule of the Health Insurance Portability and Accountability Act of 1996. De-identification aims to remove protected health information (PHI), including explicit identifiers (e.g., patient names) and quasi-identifiers (e.g., dates of birth). While structured data is relatively straightforward to de-identify, unstructured natural language is more challenging because it is not always evident when a potential identifier is communicated. As a consequence, manually or automatically, it is improbable in practice to detect and amend every potential identifier without affecting non-identifying information in a scalable manner. This dissertation seeks to address the scalability challenge in de-identification systems based on machine learning by achieving three tasks in the context of natural language de-identification. Starting with a collection of unannotated natural language clinical data, which will potentially be subject to the exploit of malicious attackers when shared, the ultimate aim of the system is to successfully recognize and, subsequently, protect the PHI. The first task of this dissertation introduces a framework, based on game theory, to model the cost and benefits for a healthcare organization (HCO) that shares EMR data and a recipient (who is a potential adversary) that may exploit the residual identifiers. Upon doing so, we introduce a strategy to discover an optimized solution for the HCO that minimizes the amount of training resources needed to allocate to achieve natural language de-identification at a sufficient level of performance. The second aspect of the scalability challenge this dissertation focuses on is how to better utilize a given set of training data for de-identification. We propose and develop a feature extraction and clustering strategy to partition clinical documents into inferred types over which de-identification models are trained, tested, and ultimately applied. For the last part of the problem, we incorporate active learning in the de-identification workflow and conduct studies to prove that, if the machine learning de-identification system can actively request information to help create a better model from outside of the system (e.g., a knowledgeable human assistant), then less training data will be needed to maintain (or even improve) the performance of trained models. Simulations on a real-world clinical trials dataset and a publicly available i2b2 dataset demonstrate the effectiveness of active learning comparing to passive learning in de-identification.