Privacy Leaks and Efficient Countermeasures for Human Genetics and Machine Learning
Modern scientific investigations have increasingly relied upon the expanded collection and analysis of data (“big data”). In the genetics community, there is evidence to suggest that increased statistical power can be achieved when genomic and phenotype data are shared beyond their initial points of collection and combined with other resources. In recognition of this opportunity, numerous initiatives such as the database of Genotypes and Phenotypes (dbGaP) have been established to facilitate the dissemination of such data to a wide array of potential users. Meanwhile, the sensitive nature of genome and phenotype data, has raised tremendous privacy concerns due to risks such as revealing personal identity and sensitive disease information. Heated discussion over genetic privacy has led the community to act conservatively in terms of data sharing by restricting data access. This dissertation begins with the introduction of novel methods and findings to breach the privacy of individuals to whom genomic data corresponds. In particular, this dissertation focuses on statistical inference methods to detect when an individual has participated in a genomic study, with a subsequent unveiling of their exact phenotype (disease status or quantitative traits), using publicly accessible information. Next, we recognize that novel technical solutions could help thwart such attacks. Specifically, this dissertation introduces a collection of cryptographic methods to protect patient privacy while supporting common statistical and machine learning models widely used in genetics (such as meta-analysis, logistic regression). It is well-known that cryptographic solutions often incur intense computation and are significantly slower than non-secure models. This is problematic because it limits the likelihood that such methods would be considered plausible for real world adoption. Thus, as a final contribution, this dissertation proposes novel algorithms to accelerate cryptography-based machine learning. Specifically, this dissertation develops several distributed optimization methods to significantly accelerate privacy-preserving distributed machine learning and validates their efficiency and accuracy extensively on large-scale datasets. Such works bridge the gap between distributed machine learning, optimization, and cryptography, and could act as drop-in replacement for many privacy-preserving methods proposed in genetic and machine learning research.