Method for Haplotype Phasing with Natural Language Processing

Datar, Parth Abhijit

Method for Haplotype Phasing with Natural Language Processing

dc.contributor.advisor	Zhou, Xin (Maizie)
dc.contributor.advisor	Huo, Yuankai
dc.creator	Datar, Parth Abhijit
dc.date.accessioned	2022-05-19T18:26:42Z
dc.date.available	2022-05-19T18:26:42Z
dc.date.created	2022-05
dc.date.issued	2022-05-19
dc.date.submitted	May 2022
dc.identifier.uri	http://hdl.handle.net/1803/17478
dc.description.abstract	Structural variant detection is an important problem in the field of computational genomics. Structural Variants (SVs) involve long DNA disruptions (e.g. insertions or deletions of more than 50 base pairs), and they are difficult to uncover precisely with short-read sequencing technology. In this thesis, we propose a diploid assembly-based SV detection tool with HiFi long reads. To achieve that, partitioning long reads into two different parental haplotypes for assembly is a crucial step, which is also called haplotyping phasing. However, given that a substantial proportion of reads do not cover any heterozygous single nucleotide polymorphisms (SNPs), which are important for haplotype phasing, these cannot be assigned to any haplotype. Those unassigned reads are also called unphased reads and they can cause problems including imperfect assembly and decreased sensitivity in SV detection. To resolve this issue, we apply natural language processing techniques to assign those reads to their most likely haplotype. To use natural language processing techniques, we create an analogous language for DNA reads through constituent windows of base-pairs called k-mers, which represent k base-pairs. These form the words in our language, and the reads form the sentences. Through hashing algorithms such as locality sensitive hashing, we are able to generate all k-mers for a read without greatly increasing vocabulary size and storage size. With the library of fastText, we are able to represent these long reads as lower dimensional read embedding through a skip-gram model with negative sampling. These read embeddings are finally reduced in dimension through t-SNE, and clustered in order to assign their haplotype. Our evaluation results have shown the assignment of unphased reads through NLP techniques improves SV detection in HiFi long reads. We also observe a positional separation of DNA reads through visualization of t-SNE data. Our analysis includes parametrization of k-mer length and embedding dimension, alongside other parameters relevant to training. Our module is modular, meaning that any of its subcomponents may be replaced for better models.
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.subject	Natural Language Processing, Computational Genomics, Structural Variant Detection, Locality Sensitive Hashing, Word Embedding
dc.title	Method for Haplotype Phasing with Natural Language Processing
dc.type	Thesis
dc.date.updated	2022-05-19T18:26:42Z
dc.type.material	text
thesis.degree.name	MS
thesis.degree.level	Masters
thesis.degree.discipline	Computer Science
thesis.degree.grantor	Vanderbilt University Graduate School
dc.creator.orcid	0000-0002-5870-9237

Files in this item

Name:: DATAR-THESIS-2022.pdf
Size:: 721.7Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Theses and Dissertations
Electronic theses and dissertations of masters and doctoral students submitted to the Graduate School.

Show simple item record