Learning Clinical Data Representations for Machine Learning
Sulieman, Lina Mahmoud
Implementing machine learning in healthcare has increased in the past years. Representing clinical data is the Crux of machine learning. Learning informative features can improve the trained models’ performance. This dissertation describes methods to learn representations for temporal and text data to improve machine learning results. Three data representations are discussed across three aims to tackle three biomedical informatics problems: 1) identifying patients at high risk of suffering from a negative outcome (readmission or death) to allocate intervention resources efficiently; 2) triaging patients’ messages and identifying their needs which requires human and time resources; 3) locating information about a phenotype in the clinical documents that requires human resources and increase information overload on healthcare providers. In the first aim, a representation leveraged the post-discharge data to predict the patients’ outcome over one year after discharge. Training the outcome prediction model on post-discharge and before-discharge data improved performance significantly compared the model trained on before-discharge clinical data only. In the second aim, the dissertation describes methods to learn representations that incorporate the semantics and the context of the words. These representations outperformed traditional features in identifying the patients’ needs in portal messages sent to healthcare providers. The results demonstrate that training machine learning models on these learned representations performs better than representations that lack those features. In the third aim, a deep learning model leveraged the clinical documents’ contents and the billing codes to learn representations for sentences. The model implemented the representations to extract the sentences that include phenotype information (i.e., relevant sentences) without using an annotated dataset. The extraction model achieved higher performance than a similar keyword-based extraction and KnowledgeMap, a clinical concepts extraction tool. The representations described in this dissertation are extensible to other electronic medical records. The proposed models can learn new representations that improve the clinical machine learning performance and can be applied to other medical informatics problems.