![]() |
|
||||||||||||||
|
|
|||||||||||||
BIOINFORMATICS SOFTWARES : here RESEARCH PROJECTS (ongoing and future plan) Haplotype Inference In general, it is costly to examine the two copies of a chromosome separately, and genotype rather than haplotype data is available, even though haplotype data is of greatest use. This arises a great need in computational methods for extracting haplotype information from genotype information. Inferring haplotypes from genotypes is called the Phasing Problem. We propose a new linear algebra based method which drastically reduces the number of sites in the original data. After solving a reduced instance, linear decoding allows for the recovery of the original haplotypes. Experiments show that our method significantly speeds up popular haplotype inference tools while finding almost the same solution in nearly every case, thus not compromising the quality of the known haplotype inference methods
. We also suggest a new approach to the Phasing Problem which is based on bicoloring an associated weighted graph
. Comparison with existing haplotype inference tools such as HAPLOTYPER, PHASE, GERBIL shows that our method is fastest and achieves similar results. Haplotype/Genotype Tagging Constructing a complete human haplotype map is helpful when associating complex diseases with their related SNPs. Unfortunately, the number of SNPs is very large and it is costly to sequence many individuals. Therefore, it is desirable to reduce the number of SNPs that should be sequenced to a small number of informative representatives called tag SNPs. We propose a new linear algebra-based method for selecting and using tag SNPs. Our method is purely combinatorial and can be combined with linkage disequilibrium (LD) and block based methods. We measure the quality of our tag SNP selection algorithm by comparing actual SNPs with SNPs predicted from selected linearly independent tag SNPs. For example, our experiments show that for long haplotypes (>25000 SNPs), knowing only 0.4% of all SNPs our method predicts an unknown haplotype with 98% accuracy while the prediction is based on 10% of the sample population. Comparison with existing predictive tagging methods show that our method achieves better accuracy using fewer tag
SNPs.. Disease Susceptibility Recent improvements in the accessibility of high-throughput genotyping have brought a great deal of attention to disease association and susceptibility studies. Current statistical methods are not powerful enough to predict with high confidence susceptibility to complex diseases. We explore the possibility of applying combinatorial methods to disease susceptibility prediction. The proposed combinatorial methods as well as standard statistical methods are applied to publicly available genotype data on Crohn's disease and autoimmune disorders for predicting susceptibility to these diseases. The quality of susceptibility prediction algorithm is assessed using leave-one-out and leave-many-out tests – the disease status of one or several individuals is predicted and compared to their actual disease status which is initially made unknown to the algorithm. The best prediction rate achieved by the proposed algorithms is 77.78% for Crohn's disease and 64.99% for autoimmune disorders, respectively. RELATED COURSES · CSC 8980 Algorithms and Data Ming in Bioinformatics, Fall 2005, grade A · BIO
6640 Fundamentals of Bioinformatics, Spring 2005, grade: A · MATH
8150 Graph Theorem, Spring 2005, grade: A · CSC
8550 Advanced Algorithms, Spring 2004, grade: A · CSC 6350. Software Engineering, Fall 2002, grade: A · CSC
6520 Design and Analysis of Algorithms, 2001 spring, grade: A. · SYE 8200 Introduction to Statistics, 1999 spring, grade: A
|
||||||||||||||
![]() |
||||||||||||||
|
|