By applying data mining techniques and machine learning, we have identified correlations between sequence patterns and local 3D structural patterns, called I-sites. These correlations can be used to predict the protein's three-dimensional structure, either in short fragments or globally using simulations. Molecular dynamics has been used to explain the stability of some short sequences.

We have developed a hidden Markov model called HMMSTR to describe the grammatical structure of conserved sequence patterns within proteins in general. The models can be used to predict protein three-dimensional local structure, secondary structure, to identify protein-coding ORFs, or to design a sequence to fit a structure. The two bioinformatic models described above constitute the first two steps (initiation and propagation) in a five-part heirarchical statistical model for protein folding pathways. The remaining steps (condensation, moltenglobule and sidechain packing) are the subjects of ongoing efforts.

To model the third step, condensation, we have developed a knowledge-based potential that predicts interresidue contacts in proteins. This research direction led to studies into the feasibility of protein structure prediction in two dimensions, the contact map appraoch. Investigations into knowledge-based potentials are continuing by establishing the first sequence-dependent backbone angle and hydrogen-bonding potentials for use in simplified molecular dynamics simulations.

To model the fourth step, we have developed a state-of-the-art method for non-sequential structural alignment of proteins, leading to a database of protein "cores". Future developments of this SCALI database will be to model each core as a hidden Markov model (HMM), for prediction of protein core structures, addressing the "molten globule" state of folding.

To model the final stages of folding, we have developed a mechanistic model called GeoFold.

Contact map predictions can be viewed by browsing the (under construction) HMMSTR-CM database. Contact map predictions can be done for any protein sequence using the HMMSTR server.

Back to research interests