• About
    • Login
    View Item 
    •   Institutional Repository Home
    • Electronic Theses and Dissertations
    • Electronic Theses and Dissertations
    • View Item
    •   Institutional Repository Home
    • Electronic Theses and Dissertations
    • Electronic Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Browse

    All of Institutional RepositoryCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsDepartmentThis CollectionBy Issue DateAuthorsTitlesSubjectsDepartment

    My Account

    LoginRegister

    Scalable Natural Language De-identification based on Machine Learning Approaches

    Li, Muqun
    : https://etd.library.vanderbilt.edu/etd-03262018-113355
    http://hdl.handle.net/1803/11460
    : 2018-03-27

    Abstract

    Electronic medical record (EMR) systems have been progressively adopted in numerous aspects of clinical care and healthcare endeavors. As the quantity and diversity of such data grows, so too has its repurposing to support secondary use in a number of settings (e.g., public health and biomedical research). However, the dissemination of such data has been relatively limited to structured data, as documents that contain natural language (e.g., clinical communications between clinicians) has posed concerns over the extent to which the privacy of the corresponding patients can be upheld. To mitigate this concern, various federal and state laws, and the agencies that oversee them, recommending minimizing the amount of information disclosed and adhering to de-identification principles, such as that specified in the Privacy Rule of the Health Insurance Portability and Accountability Act of 1996. De-identification aims to remove protected health information (PHI), including explicit identifiers (e.g., patient names) and quasi-identifiers (e.g., dates of birth). While structured data is relatively straightforward to de-identify, unstructured natural language is more challenging because it is not always evident when a potential identifier is communicated. As a consequence, manually or automatically, it is improbable in practice to detect and amend every potential identifier without affecting non-identifying information in a scalable manner. This dissertation seeks to address the scalability challenge in de-identification systems based on machine learning by achieving three tasks in the context of natural language de-identification. Starting with a collection of unannotated natural language clinical data, which will potentially be subject to the exploit of malicious attackers when shared, the ultimate aim of the system is to successfully recognize and, subsequently, protect the PHI. The first task of this dissertation introduces a framework, based on game theory, to model the cost and benefits for a healthcare organization (HCO) that shares EMR data and a recipient (who is a potential adversary) that may exploit the residual identifiers. Upon doing so, we introduce a strategy to discover an optimized solution for the HCO that minimizes the amount of training resources needed to allocate to achieve natural language de-identification at a sufficient level of performance. The second aspect of the scalability challenge this dissertation focuses on is how to better utilize a given set of training data for de-identification. We propose and develop a feature extraction and clustering strategy to partition clinical documents into inferred types over which de-identification models are trained, tested, and ultimately applied. For the last part of the problem, we incorporate active learning in the de-identification workflow and conduct studies to prove that, if the machine learning de-identification system can actively request information to help create a better model from outside of the system (e.g., a knowledgeable human assistant), then less training data will be needed to maintain (or even improve) the performance of trained models. Simulations on a real-world clinical trials dataset and a publicly available i2b2 dataset demonstrate the effectiveness of active learning comparing to passive learning in de-identification.
    Show full item record

    Files in this item

    Icon
    Name:
    Li.pdf
    Size:
    5.591Mb
    Format:
    PDF
    View/Open

    This item appears in the following collection(s):

    • Electronic Theses and Dissertations

    Connect with Vanderbilt Libraries

    Your Vanderbilt

    • Alumni
    • Current Students
    • Faculty & Staff
    • International Students
    • Media
    • Parents & Family
    • Prospective Students
    • Researchers
    • Sports Fans
    • Visitors & Neighbors

    Support the Jean and Alexander Heard Libraries

    Support the Library...Give Now

    Gifts to the Libraries support the learning and research needs of the entire Vanderbilt community. Learn more about giving to the Libraries.

    Become a Friend of the Libraries

    Quick Links

    • Hours
    • About
    • Employment
    • Staff Directory
    • Accessibility Services
    • Contact
    • Vanderbilt Home
    • Privacy Policy