Applying Active Learning to Biomedical Text Processing

Chen, Yukun

Applying Active Learning to Biomedical Text Processing

dc.creator	Chen, Yukun
dc.date.accessioned	2020-08-22T17:22:48Z
dc.date.available	2013-07-29
dc.date.issued	2013-07-29
dc.identifier.uri	https://etd.library.vanderbilt.edu/etd-07122013-162658
dc.identifier.uri	http://hdl.handle.net/1803/12949
dc.description.abstract	Objective: Supervised machine learning methods have shown good performance in text classification tasks in the biomedical domain, but they often require large annotated corpora, which are costly to develop. Our goal is to assess whether active learning strategies can be integrated with supervised machine learning methods, thus reducing the annotation cost while keeping or improving the quality of classification models for biomedical text. Methods: We have applied active learning to two biomedical natural language processing (NLP) tasks: 1) the assertion classification task in the 2010 i2b2/VA Clinical NLP Challenge, which was to determine the assertion status of clinical concepts; and 2) a supervised word sense disambiguation (WSD) task that was to disambiguate 197 ambiguous words and abbreviations in MEDLINE abstracts. We developed Support Vector Machines (SVMs) based classifiers for both tasks. We then implemented several existing and newly developed active learning algorithms to integrate with SVM classifiers and evaluated their performance on both tasks. Results: In assertion classification task, our results showed that to achieve the same classification performance, active learning strategies required much fewer samples than the random sampling method. For example, to achieve an AUC of 0.79, the random sampling method used 32 samples, while our best active learning algorithm required only 12 samples, a reduction of 62.5% in manual annotation effort. In the WSD task, our results also demonstrated that active learners significantly outperformed the passive learner, showing better performance for 177 out of 197 (89.8%) ambiguous terms. Further analysis showed that to achieve an average accuracy of 90%, the passive learner needed 38 samples, while the active learners needed only 24 annotated samples, a 37% reduction of annotation effort. Moreover, we also analyzed cases where active learning algorithms did not achieve superior performance and summarized three causes: (1) poor model in early learning stage; (2) easy WSD cases; and (3) difficult WSD cases, which provide useful insight for future improvements. Conclusion: Both studies demonstrated that integrating active learning strategies with supervised learning methods could effectively reduce annotation cost and improve the classification models in biomedical text processing.
dc.format.mimetype	application/pdf
dc.subject	Active Learning
dc.subject	Natural Laugnage Processing
dc.subject	Biomedical Text Processing
dc.subject	Machine Learning
dc.title	Applying Active Learning to Biomedical Text Processing
dc.type	thesis
dc.contributor.committeeMember	Thomas Lasko
dc.contributor.committeeMember	Qiaozhu Mei
dc.type.material	text
thesis.degree.name	MS
thesis.degree.level	thesis
thesis.degree.discipline	Biomedical Informatics
thesis.degree.grantor	Vanderbilt University
local.embargo.terms	2013-07-29
local.embargo.lift	2013-07-29
dc.contributor.committeeChair	Hua Xu
dc.contributor.committeeChair	Joshua C. Denny

Files in this item

Name:: Chen.pdf
Size:: 1.300Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Theses and Dissertations
Electronic theses and dissertations of masters and doctoral students submitted to the Graduate School.

Show simple item record