On Optimal Prediction Rules With Prospective Missingness and Bagged Empirical Null Inference in Large-Scale Data
Mercaldo, Sarah Fletcher
This dissertation consists of three papers related to missing data, prediction, and large scale inference. The first paper defines the problem of obtaining predictions from an exist- ing clinical risk prediction model when covariates are missing. We introduce the Pattern Mixture Kernel Submodel - submodels fit within each missing data pattern - that minimize prediction error in the presence of missingness. PMKS is explored in simulations and a case study, outperforming standard simple and multiple imputation techniques. The second paper introduces the Bagged Empirical Null p-value, a new algorithm that combines exist- ing methodology of Bagging and Empirical Null techniques to identify important effects in massive high-dimensional data. We illustrate the approach using a famous leukemia gene example where we uncovered new findings that are supported by previously published bench- work and we evaluate the algorithm’s performance in novel pseudo-simulations. The third paper gives recommendations for including the outcome in the imputation model during construction, validation, and application. We suggest only including the outcome for impu- tation of missing covariate values during model construction to obtain unbiased parameter estimates. When the outcome is used in the imputation algorithm during the validation step, we show through simulation, the model prediction metrics are optimistically inflated, and the actual pragmatic model performance would be inferior to the validated results. While the three papers presented here provide foundations for missing data and large scale inferential techniques, these ideas are applicable to a wide range of biomedical settings.