Show simple item record

Improving Representation in Biomedical Datasets

dc.contributor.advisorMalin, Bradley A
dc.creatorBorza, Victor Alexander
dc.date.accessioned2024-08-15T15:32:43Z
dc.date.created2024-08
dc.date.issued2024-06-27
dc.date.submittedAugust 2024
dc.identifier.urihttp://hdl.handle.net/1803/19129
dc.description.abstractBiomedical studies have long struggled to adequately include all groups in the populations they intend to study. Studies have tended to underrepresent groups who have been minoritized and marginalized, perpetuating and compounding existing health disparities. In addition to downstream effects, lack of representation can harm the validity of scientific studies and trust in the research enterprise. In this thesis, we introduce optimizable and computable measures of representativeness and diversity that measure the similarity of the demographics of a research cohort to a target population and uniform distribution, respectively. We then use these measures to develop a method for representative subsampling from an existing database. We evaluated efficacy by subsampling de-identified electronic health records from the Vanderbilt University Medical Center (VUMC) to mirror the demographics of the VUMC catchment area. Compared to a random record selection process, the cohorts selected by our method were 5.8 times more representative. Starting with a more representative dataset increases statistical power and may improve accuracy and fairness of resulting analyses when compared to subsampling from an unrepresentative dataset, we extended our methodology to prospective participant recruitment across multiple sites. In simulated recruitment from the nine medical centers in the Stakeholders Technology and Research Clinical Research Network, our methods for adaptively allocating recruitment resources were able to yield a more representative study cohort than existing baseline methods. Moreover, recruitment from a combination of multiple sites yielded a more representative cohort than any single site could. After proving our methodology in smaller-scale experiments, we extended experiments to optimize recruitment resource allocation for both diversity and representation in the nationwide All of Us Research Program. Again, our simulation results show that both diversity and representation could be improved with adaptive resource allocation among recruitment sites, whether those sites represent geographic location (three-digit ZIP code) or hospital networks.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectrepresentation
dc.subjectdiversity
dc.subjectelectronic health records
dc.subjectrecruitment
dc.subjectparticipant
dc.titleImproving Representation in Biomedical Datasets
dc.typeThesis
dc.date.updated2024-08-15T15:32:44Z
dc.contributor.committeeMemberClayton, Ellen W
dc.contributor.committeeMemberSulieman, Lina
dc.contributor.committeeMemberVorobeychik, Yevgeniy
dc.type.materialtext
thesis.degree.nameMS
thesis.degree.levelMasters
thesis.degree.disciplineBiomedical Informatics
thesis.degree.grantorVanderbilt University Graduate School
local.embargo.terms2025-08-01
local.embargo.lift2025-08-01
dc.creator.orcid0000-0002-5807-3996


Files in this item

Icon

This item appears in the following Collection(s)

Show simple item record