Using observational data in healthcare research: New methods to design, conduct, and analyze efficient two-phase designs
Lotspeich, Sarah Camilla
Clinically meaningful variables are increasingly becoming available in observational databases. However, these data can be error-prone, giving misleading results in statistical inference. Data validation can help maintain data quality, but validating entire databases is often unrealistic. A cost-effective solution is the two-phase design: error-prone variables are observed for all patients during Phase I and that information is used to select patients for validation (i.e., data auditing) during Phase II. In this dissertation, we propose methods to promote the practical and statistical efficiency of two-phase designs to ensure the integrity of observational cohort data. First, given the resource constraints imposed upon data audits, targeting the most informative patients is paramount for efficient statistical inference. Using the asymptotic variance of the maximum likelihood estimator, we compute the most efficient design under complex outcome and exposure misclassification. Since the optimal design depends on unknown parameters, we propose a multi-wave design to approximate it in practice. We demonstrate the superior efficiency of the optimal designs through extensive simulations and illustrate their implementation in observational HIV studies. Second, sending trained auditors to sites (“travel-audits”) can be costly, particularly in a multi-national cohort, so we investigate the efficacy of training sites to conduct “self-audits.” In 2017, eight research groups audited a subset of their patient records, comparing abstracted research data to the original clinical source documents. Additionally, three sites were randomly selected for travel-audits. We found similar error rates between self- and travel-audits, suggesting self-audits could be a lower-cost alternative for continued data quality. Finally, to obtain efficient odds ratios with partially-audited, error-prone data, we propose a semiparametric analysis approach that uses all information and accommodates many error mechanisms. The outcome and covariates can be error-prone, with correlated errors, and the selection of Phase II records can depend on Phase I data in an arbitrary manner. We devise an EM algorithm to obtain estimators that are consistent, asymptotically normal, and asymptotically efficient. We demonstrate the advantages of the proposed methods through extensive simulations and provide applications to a multi-national HIV cohort.