Research Interests :

  • Algorithms and Data Ming in Bioinformatics: Developing tools for faster and more accurate SNP and haplotype analysis. These tools are used in genetic epidemiology studies.
  • Distributed Computing and Databases: Middleware for Wireless Personal Communication Devices (iPAQ's), Distributed and Mobile Computing Systems.
  • Machine Learning

BIOINFORMATICS SOFTWARES : here

RESEARCH PROJECTS (ongoing and future plan)

Haplotype Inference

In general, it is costly to examine the two copies of a chromosome separately, and genotype rather than haplotype data is available, even though haplotype data is of greatest use. This arises a great need in computational methods for extracting haplotype information from genotype information. Inferring haplotypes from genotypes is called the Phasing Problem.  We propose a new linear algebra based method which drastically reduces the number of sites in the original data. After solving a reduced instance, linear decoding allows for the recovery of the original haplotypes. Experiments show that our method significantly speeds up popular haplotype inference tools while finding almost the same solution in nearly every case, thus not compromising the quality of the known haplotype inference methods . We also suggest a new approach to the Phasing Problem which is based on bicoloring an associated weighted graph . Comparison with existing haplotype inference tools such as HAPLOTYPER, PHASE, GERBIL shows that our method is fastest and achieves similar results.

Haplotype/Genotype Tagging

Constructing a complete human haplotype map is helpful when associating complex diseases with their related SNPs. Unfortunately, the number of SNPs is very large and it is costly to sequence many individuals. Therefore, it is desirable to reduce the number of SNPs that should be sequenced to a small number of informative representatives called tag SNPs. We propose a new linear algebra-based method for selecting and using tag SNPs. Our method is purely combinatorial and can be combined with linkage disequilibrium (LD) and block based methods. We measure the quality of our tag SNP selection algorithm by comparing actual SNPs with SNPs predicted from selected linearly independent tag SNPs. For example, our experiments show that for long haplotypes (>25000 SNPs), knowing only 0.4% of all SNPs our method predicts an unknown haplotype with 98% accuracy while the prediction is based on 10% of the sample population. Comparison with existing predictive tagging methods show that our method achieves better accuracy using fewer tag SNPs..

Disease Susceptibility

Recent improvements in the accessibility of high-throughput genotyping have brought a great deal of attention to disease association and susceptibility studies. Current statistical methods are not powerful enough to predict with high confidence susceptibility to complex diseases. We explore the possibility of applying combinatorial methods to disease susceptibility prediction. The proposed combinatorial methods as well as standard statistical methods are applied to publicly available genotype data on Crohn's disease and autoimmune disorders for predicting susceptibility to these diseases. The quality of susceptibility prediction algorithm is assessed using leave-one-out and leave-many-out tests – the disease status of one or several individuals is predicted and compared to their actual disease status which is initially made unknown to the algorithm. The best prediction rate achieved by the proposed algorithms is 77.78% for Crohn's disease and 64.99% for autoimmune disorders, respectively.

RELATED COURSES

       GEORGIA STATE UNIVERSITY Atlanta,
       Ph.D. /M.S. Computer Science, 2006/2002     GPA: 4.00/4.00

·       CSC 8980 Algorithms and Data Ming in Bioinformatics,  Fall 2005, grade A

·      BIO 6640 Fundamentals of Bioinformatics,  Spring 2005, grade: A  

·      MATH 8150 Graph Theorem,  Spring 2005, grade: A 

·      CSC 8550  Advanced Algorithms,  Spring 2004, grade: A

·       CSC 6350. Software Engineering, Fall 2002, grade: A

·       CSC 6520 Design and Analysis of Algorithms, 2001 spring, grade: A.  

·       SYE 8200 Introduction to Statistics, 1999 spring, grade: A

 

     


Author: Jim Jingwu He