Clustering Rare Event Features to Increase Statistical Power
Sivley, Robert Michael
Rare genetic variation has been put forward as a major contributor to the development of disease; however, it is inherently difficult to associate rare variants with disease, as the low number of observations greatly reduces statistical power. Binning is a method that groups several variants together and merges them into a single feature, sacrificing resolution to increase statistical power. Binning strategies are applicable to rare variant analysis in any field, though their effectiveness is dependent on the method used to group variants. This thesis presents a flexible workflow for rare variant analysis, comprised of five sequential steps: identification of rare variants, annotation of those variants, clustering the variants, collapsing those clusters, and statistical analysis. There are no restrictions on which clustering algorithms are applied, so a review of the core clustering paradigms is provided as an introduction for readers unfamiliar with the field. Also presented is RVCLUST, an R package that facilitates all stages of the described workflow and provides a collection of interfaces to common clustering algorithms and statistical tests. The utility of RVCLUST is demonstrated in a genetic analysis of rare variants in gene regulatory regions and their effect on gene expression. The results of this analysis suggest that informed clustering is an effective alternative to existing strategies, discovering the same associations while avoiding the statistical complications introduced by other binning methods.