Active Learning for Named Entity Recognition in Clinical Text

Chen, Yukun

Active Learning for Named Entity Recognition in Clinical Text

dc.creator	Chen, Yukun
dc.date.accessioned	2020-08-22T17:04:59Z
dc.date.available	2016-06-25
dc.date.issued	2015-06-25
dc.identifier.uri	https://etd.library.vanderbilt.edu/etd-06122015-162419
dc.identifier.uri	http://hdl.handle.net/1803/12541
dc.description.abstract	Named entity recognition (NER) is one of the fundamental tasks for building clinical natural language processing (NLP) systems. Machine learning (ML) based approaches can achieve good performance. However, they often require large numbers of annotated samples, which are expensive to build with the use of domain experts in annotation. Active learning (AL), a sample selection approach that can be integrated with supervised ML, has shown the promising potential to minimize the annotation cost while maximizing the performance of ML-based models in various NLP tasks. However, very few studies have investigated AL for clinical NER in a real-life setting. In this dissertation research, I systematically studied AL in a clinical NER task to identify medical problems, treatments, and lab tests in clinical notes. Novel AL algorithms were developed to query the most informative and least costly sentences based on three properties: uncertainty, representativeness, and annotation time. I also developed the first AL-enabled annotation system for clinical NER. Using this system, I further conducted user studies to assess the performance of AL in real world annotation processes for building clinical NER systems. The initial user study shows that conventional AL methods with no consideration of annotation time did not always perform better than random sampling for different users. However, our newly developed AL algorithms with cost models for estimating annotation time were more promising in practice. To achieve an NER model with 0.70 in F-measure, simulated results show that the new AL method saved ~33.3% in estimated annotation time, compared to random sampling. In the user study, the new AL algorithm achieved better performance than random sampling and saved up to ~26.5% real annotation time for one of the users. To the best of our knowledge, this is the first study examining the practical AL systems for clinical NER. Our study demonstrates that AL has the potential to save annotation time and improve model quality for building ML-based NER systems, when novel querying algorithms are implemented. Our future work includes developing better querying algorithms and evaluating the system with larger number of users.
dc.format.mimetype	application/pdf
dc.subject	natural language processing
dc.subject	named entity recognition
dc.subject	machine learning
dc.subject	Active learning
dc.subject	clinical NLP
dc.title	Active Learning for Named Entity Recognition in Clinical Text
dc.type	dissertation
dc.contributor.committeeMember	Thomas A. Lasko
dc.contributor.committeeMember	Qiaozhu Mei
dc.contributor.committeeMember	Qingxia Chen
dc.type.material	text
thesis.degree.name	PHD
thesis.degree.level	dissertation
thesis.degree.discipline	Biomedical Informatics
thesis.degree.grantor	Vanderbilt University
local.embargo.terms	2016-06-25
local.embargo.lift	2016-06-25
dc.contributor.committeeChair	Joshua C. Denny
dc.contributor.committeeChair	Hua Xu

Files in this item

Name:: Chen.pdf
Size:: 3.406Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Theses and Dissertations
Electronic theses and dissertations of masters and doctoral students submitted to the Graduate School.

Show simple item record