Computational Phenotyping and Phenome-wide Association Studies: Leveraging Machine Learning and Natural Language Processing to Understand Electronic Health Record Data
Teixeira, Pedro Luis, Jr.
The aims of this project are 1) to evaluate various data sources and algorithms for identifying hypertensive individuals within the electronic health record, and 2) to develop and evaluate a novel method for identifying associations between genotypes and natural language processing-based phenotypes extracted from the electronic health record. The author evaluated data sources and hypertension phenotyping algorithms using a set of 631 individuals manually reviewed for hypertension status based on their electronic health record data. Combinations of data sources outperformed methods that leveraged any category individually. Random forest models trained with billing codes, medications, vital signs, and hypertension concept counts achieved a median AUC of 0.976. The best algorithms performed similarly at a second site. The author also developed a novel method for phenome-wide association studies using natural language processing-based phenotypes (NLP-PheWAS). Using 29,722 individuals with Exome data, the author extracted 11,553 unique concepts from narrative text after negation, note section, and semantic type filtering. The method replicated 43.7% of known, statistically powered associations from the National Human Genome Research Institute’s genome-wide association catalog. NLP-PheWAS also identified two potentially novel associations among the SNPs studied. They included an association between optic disc neovascularization and rs1497546 and between Langerhans-Cell Histiocytosis and rs7193343. NLP-PheWAS is a promising method for enabling rapid discovery, interpretation of novel associations, and increased understanding of genetic influences within the rapidly expanding narrative text of electronic health records.