Show simple item record

Method for Haplotype Phasing with Natural Language Processing

dc.contributor.advisorZhou, Xin (Maizie)
dc.contributor.advisorHuo, Yuankai
dc.creatorDatar, Parth Abhijit 2022
dc.description.abstractStructural variant detection is an important problem in the field of computational genomics. Structural Variants (SVs) involve long DNA disruptions (e.g. insertions or deletions of more than 50 base pairs), and they are difficult to uncover precisely with short-read sequencing technology. In this thesis, we propose a diploid assembly-based SV detection tool with HiFi long reads. To achieve that, partitioning long reads into two different parental haplotypes for assembly is a crucial step, which is also called haplotyping phasing. However, given that a substantial proportion of reads do not cover any heterozygous single nucleotide polymorphisms (SNPs), which are important for haplotype phasing, these cannot be assigned to any haplotype. Those unassigned reads are also called unphased reads and they can cause problems including imperfect assembly and decreased sensitivity in SV detection. To resolve this issue, we apply natural language processing techniques to assign those reads to their most likely haplotype. To use natural language processing techniques, we create an analogous language for DNA reads through constituent windows of base-pairs called k-mers, which represent k base-pairs. These form the words in our language, and the reads form the sentences. Through hashing algorithms such as locality sensitive hashing, we are able to generate all k-mers for a read without greatly increasing vocabulary size and storage size. With the library of fastText, we are able to represent these long reads as lower dimensional read embedding through a skip-gram model with negative sampling. These read embeddings are finally reduced in dimension through t-SNE, and clustered in order to assign their haplotype. Our evaluation results have shown the assignment of unphased reads through NLP techniques improves SV detection in HiFi long reads. We also observe a positional separation of DNA reads through visualization of t-SNE data. Our analysis includes parametrization of k-mer length and embedding dimension, alongside other parameters relevant to training. Our module is modular, meaning that any of its subcomponents may be replaced for better models.
dc.subjectNatural Language Processing, Computational Genomics, Structural Variant Detection, Locality Sensitive Hashing, Word Embedding
dc.titleMethod for Haplotype Phasing with Natural Language Processing
dc.type.materialtext Science University Graduate School

Files in this item


This item appears in the following Collection(s)

Show simple item record