Synthetic Data Simulation for Privacy-preserving Medical Data Sharing
The past decade has witnessed a dramatic rise in the adoption of electronic medical record (EMR) systems. EMR data assists in the development and evaluation of clinical information systems, supports novel biomedical research, and enables development and refinement of clinical decision support technologies within and beyond the healthcare organization (HCO) that collected the information. As such, HCOs are incentivized, and sometimes required, to make such data more broadly available. However, concerns over patient privacy often limit EMR data sharing activities. Simulating synthetic EMR data offers an opportunity to resolve the tension between data sharing and patient privacy. When designed appropriately, synthetic datasets are expected to induce minimal privacy disclosure risks while maintaining similar utility as the original data upon which they are based to support hypothesis formulation and testing for precision medicine. This dissertation introduces a data-driven pipeline for large-scale synthetic EMR simulation, which includes: 1) generative models to support synthetic data generation, 2) an evaluation system to assess the utility of synthetic data, and 3) a detailed investigation into the privacy implications associated with the sharing synthetic data. Specifically, this dissertation introduces deep learning-based modeling frameworks for static patient profiles and longitudinal medical records generation in both full synthesis and partial synthesis manners. The frameworks are evaluated with EMR data from Vanderbilt University Medical Center and the All of Us Research Program, a large diverse cohort from participants across the United States. The analysis in this dissertation shows that the resulting synthetic datasets exhibit similar statistics and predictive capabilities to those found in the real datasets and are, for all intents and purposes, computationally indistinguishability from the real data. Both fully synthetic patient profiles and longitudinal records are resistant to privacy attacks in the forms of membership inference, whereby an adversary infers if the data from target individuals were relied upon by the synthetic data generation process, and attribute inference, whereby an adversary infers the information of target individuals involved in the generation; while partially synthetic longitudinal records are subject to membership inference using the state-of-the-art machine learning methods, when the adversary has complete knowledge of the target individuals.