A Machine Learning-Based Information Retrieval Framework for Molecular Medicine Predictive Models
Wehbe, Firas Hazem
Molecular medicine encompasses the application of molecular biology techniques and knowledge to the prevention, diagnosis and treatment of diseases and disorders. Statistical and computational models can predict clinical outcomes, such as prognosis or response to treatment, based on the results of molecular assays. For advances in molecular medicine to translate into clinical results, clinicians and translational researchers need to have up-to-date access to high-quality predictive models. The large number of such models reported in the literature is growing at a pace that overwhelms the human ability to manually assimilate this information. Therefore the important problem of retrieving and organizing the vast amount of published information within this domain needs to be addressed. The inherent complexity of this domain and the fast pace of scientific discovery make this problem particularly challenging. This dissertation describes a framework for retrieval and organization of clinical bioinformatics predictive models. A semantic analysis of this domain was performed. The semantic analysis informed the design of the framework. Specifically, it allowed the development of a specialized annotation scheme of published articles that can be used for meaningful organization and for indexing and efficient retrieval. This annotation scheme was codified using an annotation form and accompanying guidelines document that were used by multiple human experts to annotate over 1000 articles. These datasets were then used to train and test support vector machine (SVM) machine learning classifiers. The classifiers were designed to provide a scalable mechanism to replicate human experts’ ability (1) to retrieve relevant MEDLINE articles and (2) to annotate these articles using the specialized annotation scheme. The machine learning classifiers showed very good predictive ability that was also shown to generalize to different disease domains and to datasets annotated by independent experts. The experiments highlighted the need for providing unambiguous operational definitions of the complex concepts used for semantic annotations. The impact of the semantic definitions on the quality of manual annotations and on the performance of the machine learning classifiers was discussed.