Inferring Genomic Sequences

02/15/2011 12:30 pm
02/15/2011 2:00 pm
Category: 
Ph.D. Dissertation Defense
Advisor: 
Dr. Alexander Zelikovsky

In this dissertation, we address two different genomic inference problems, namely, assembling viral quasispecies sequences and estimating their frequencies from ultra-deep sequencing data and inferring the allelic value of single nucleotide polymorphisms (SNPs) from the set of chosen informative (tag) SNPs.

Ultra-deep sequencing, provided by a 454 Life Sciences system, is a promising technology for Hepatitis C virus (HCV) quasispecies analysis. Since the 454 Life Science system was originally designed for DNA assembling, a new software tool should be developed to assemble sequenced reads in multiple sequences. We develop efficient algorithmic techniques for assembling viral quasispecies sequences from 454 Life Sciences reads and estimate their frequencies. The proposed Viral Spectrum Assembler (ViSpA) includes (1) handling of contaminated reads and overlaps with partial agreement between reads, (2) maximum bandwidth path selection and mutation-based clustering for haplotype assembling, and (3) frequency estimation via the expectation maximization method. ViSpA has been compared with state-of-the-art ShoRAH simulated reads and real 454 pyrosequencing shotgun reads from HCV and HIV. Experimental results show that ViSpA is better in assembling. Indeed, on an HCV dataset, ViSpA reconstructs the 10 most frequent sequences, each of which represents a viable protein. The most frequent sequence has been within 1% of the actual ORF obtained by cloning the quasispecies. ShoRAH was able to reconstruct only one sequence that represents a viable protein. On an HIV dataset, ViSpA correctly reconstructs 3 quasispecies without any mismatches whereas ShoRAH reconstructs 2 quasispecies with at most 4 mismatches.

The main obstacles in genome-wide disease association studies are high genotyping cost and the computational complexity of the analysis. Tagging saves money since only tag SNPs are genotyped or, alternatively, reduces the complexity of the analysis since the size of the data is reduced. We explore the trade-off between the number of tags and overfitting and propose an efficient heuristic for finding a minimum number of tags when at most two tags can represent a variation of an untagged SNP.

Committee
Dr. Alexander Zelikovsky (chair)
Dr. Yi Pan
Dr. Robert Harrison
Dr. Yury Khudyakov

Department Conference Room