Active Learning for Named Entity Recognition in Clinical Text
Named entity recognition (NER) is one of the fundamental tasks for building clinical natural language processing (NLP) systems. Machine learning (ML) based approaches can achieve good performance. However, they often require large numbers of annotated samples, which are expensive to build with the use of domain experts in annotation. Active learning (AL), a sample selection approach that can be integrated with supervised ML, has shown the promising potential to minimize the annotation cost while maximizing the performance of ML-based models in various NLP tasks. However, very few studies have investigated AL for clinical NER in a real-life setting. In this dissertation research, I systematically studied AL in a clinical NER task to identify medical problems, treatments, and lab tests in clinical notes. Novel AL algorithms were developed to query the most informative and least costly sentences based on three properties: uncertainty, representativeness, and annotation time. I also developed the first AL-enabled annotation system for clinical NER. Using this system, I further conducted user studies to assess the performance of AL in real world annotation processes for building clinical NER systems. The initial user study shows that conventional AL methods with no consideration of annotation time did not always perform better than random sampling for different users. However, our newly developed AL algorithms with cost models for estimating annotation time were more promising in practice. To achieve an NER model with 0.70 in F-measure, simulated results show that the new AL method saved ~33.3% in estimated annotation time, compared to random sampling. In the user study, the new AL algorithm achieved better performance than random sampling and saved up to ~26.5% real annotation time for one of the users. To the best of our knowledge, this is the first study examining the practical AL systems for clinical NER. Our study demonstrates that AL has the potential to save annotation time and improve model quality for building ML-based NER systems, when novel querying algorithms are implemented. Our future work includes developing better querying algorithms and evaluating the system with larger number of users.