Bystroff Publications with abstracts
The use of cell-cell communication or "quorum sensing (QS)" elements from Gram-negative Proteobacteria has enabled synthetic biologists to begin engineering systems composed of multiple interacting organisms. However, additional tools are necessary if we are to progress towards synthetic microbial consortia that exhibit more complex, dynamic behaviors. EsaR from Pantoea stewartii subsp. stewartii is a QS regulator that binds to DNA as an apo-protein, and releases the DNA when it binds to its cognate signal molecule, 3 oxohexanoyl-homoserine lactone (3OC6HSL). In the absence of 3OC6HSL, EsaR binds to DNA and can act as either an activator or a repressor of transcription. Gene expression from PesaR , which is repressed by wild-type EsaR, requires 100 to 1000-fold higher concentrations of signal than commonly used QS activators, such as LuxR and LasR. Here we have identified EsaR variants with increased sensitivity to 3OC6HSL using directed evolution and a dual ON/OFF screening strategy. Although we targeted EsaR-dependent derepression of PesaR , our EsaR variants also showed increased 3OC6HSL-sensitivity at a second promoter, PesaS , which is activated by EsaR in the absence of 3OC6HSL. Here, the increase in AHL sensitivity led to gene expression being turned off at lower concentrations of 3OC6HSL. Overall, we have increased the signal sensitivity of EsaR more than 70-fold and generated a set of EsaR variants that recognize 3OC6HSL concentrations ranging over four orders of magnitude. QS-dependent transcriptional regulators that bind to DNA and are active in the absence of a QS signal represent a new set of tools for engineering cell-cell communication-dependent gene expression.
Nature possesses a secret formula for the energy as a func-tion of the structure of a protein. In protein design, ap-proximations are made to both the structural representation of the molecule and to the form of the energy equation, such that the existence of a general energy function for proteins is by no means guaranteed. Here we present new insights to-wards the application of machine learning to the problem of finding a general energy function for protein design. Ma-chine learning requires the definition of an objective func-tion, which carries with it the implied definition of success in protein design. We explored four functions, consisting of two functional forms, each with two criteria for success. Optimization was carried out by a Monte Carlo search through the space of all variable parameters. Cross-validation of the optimized energy function against a test set gave significantly different results depending on the choice of objective function, pointing to relative correctness of the built-in assumptions. Novel energy cross-terms correct for the observed non-additivity of energy terms and an imbal-ance in the distribution of predicted amino acids.
Green fluorescent protein (GFP) has journeyed far from its role in nature as a cofactor in the jellyfish light organ, having been bioengineered and re-engineered over three decades to take on a wide variety of service roles in molecular imaging and sensing. In this chapter, we explore the ways GFP has been used as a biomarker and biosensor, its capabilities, its strengths and weaknesses, and its potential for future applications. To begin, we will review what is known about the GFP structure, its extreme kinetic stability and its very slow and multiphasic folding kinetics. Biophysical characteristics will be covered, including the chemical and structural requirements for the autocatalyzed maturation of the integral fluorescent chromophore, its excitation/emission spectra, and its variety of enginered emission wavelengths. Efforts in protein engineering have produced GFP variants with faster folding, faster chromophore maturation and increased solubility. Circular and non-circular permutations of the GFP polypeptide chain are found to be well tolerated, as are many ways of splitting the chain into two parts, leading to biosensors based on circularly permuted and split GFP.
We review several GFP-based biomarkers and biosensors, with emphasis on their construction, their detection targets and the applications. Among the detection targets are pH, ions, reactive oxygen species, proteins, peptides and enzyme activity. Biosensors are created from GFP by making mutations that change its sensitivity, or by fusing it to functional domains, or by splicing functional domains and loops into exposed loops of GFP, or by splitting GFP. Forster resonance energy transfer (FRET) is used in many cases as a powerful and sensitive means of detecting interacting components of a system. Finally, there is considerable promise for the future of GFP-biosensors created by computational protein design, in which the site of one of the eleven beta strands is replaced by a binding site for a desired target peptide. Proofs of concept are presented here.
Summary: Protein unfolding is modeled as an ensemble of pathways, where each step in each pathway is the addition of one topologically possible conformational degree of freedom. Starting with a known protein structure, GeoFold hierarchically partitions (cuts) the native structure into substructures using revolute joints and translations. The energy of each cut and its activation barrier are calculated using buried solvent accessible surface area, side chain entropy, hydrogen bonding, buried cavities, and backbone degrees of freedom. A directed acyclic graph is constructed from the cuts, representing a network of simultaneous equilibria. Finite difference simulations on this graph simulate native unfolding pathways. Experimentally observed changes in the unfolding rates for disulfide mutants of barnase, T4 lysozyme, dihydrofolate reductase, and factor for inversion stimulation were qualitatively reproduced in these simulations. Detailed unfolding pathways for each case explain the effects of changes in the chain topology on the folding energy landscape. GeoFold is a useful tool for the inference of the effects of topology and mutation on the energy landscape of protein unfolding.
Psychrophilic organisms have adapted to live at low temperatures by using a variety of mechanisms. Here, we examine twenty homologous enzyme pairs from psychrophiles and mesophiles to investigate flexibility as a key characteristic for cold adaptation. B- factors in protein X-ray structures are one way to measure flexibility. Comparing psychrophilic to mesophilic protein B-factors shows that psychrophilic enzymes are more flexible in 3-turn and strand secondary structures. Enzyme cavities, identified using CASTp at various probe sizes, indicate that psychrophilic enzymes have larger average cavity sizes at probe radii of 1.4-1.5Å Furthermore, amino acid side chains lining these cavities show an increased frequency of acidic groups in psychrophilic enzymes. These findings suggest that embedded water molecules may play a significant role in cavity flexibility, and therefore, overall protein flexibility. Thus, our results point to the important role enzyme flexibility plays in adaptation to cold environments.
ABSTRACT: Several versions of split green fluorescent protein (GFP) fold and reconstitute fluorescence, as do many circular permutants, but little is known about the dependence of reconstitution on circular permutation. Explored here is the capacity of GFP to fold and reconstitute fluorescence from various truncated circular permutants, herein called "leave-one-outs" using a quantitative in vivo solubility assay and in vivo reconstitution of fluorescence. Twelve leave-one-out permutants are discussed, one for each of the 12 secondary structure elements. The results expand the outlook for the use of permuted split GFPs as specific and self-reporting gene encoded affinity reagents.
The pathway which proteins take to fold can be influenced from the earliest events of structure formation. In this light, it was both predicted and confirmed that increasing the stiffness of a beta hairpin turn decreased the size of the transition state ensemble (TSE) while increasing the folding rate. Thus there appears to be a relationship between conformationally restricting the TSE and increasing the folding rate, at least for beta hairpin turns. In this study, we hypothesize that the enormous sampling necessary to fold even two-state folding proteins in silico could be reduced if local structure constraints were used to restrict structural heterogeneity by polarizing folding pathways or forcing folding into preferred routes. Using a Go model we fold Chymotrypsin Inhibitor 2 (CI-2) and the SH3 domain after constraining local sequence windows to their native structure by rigid body dynamics. Trajectories were monitored for any changes to the folding pathway and differences in the kinetics compared to unconstrained simulations. For both proteins folding time is generally decreased after constraining any local sequence window. Structural polarization of the folding pathway appears to explain these rate increases and occurs regardless of whether the locally constrained structure exists in the native TSE or not. Folding rate enhancements are consistent with the goal to reduce sampling time necessary to reach native structures during folding simulations. Interestingly, not all constrained windows decreased folding time equally. We conclude by analyzing these differences and explain why rigid body dynamics may be the preferred way to constrain structure.
The sequential order of secondary structural elements in proteins affects the folding and activity to an unknown extent. To test the dependence on sequential connectivity, we reconnected secondary structural elements by their solvent-exposed ends, permuting their sequential order, called "rewiring". This new protein design strategy changes the topology of the backbone without changing the core side chain packing arrangement. While circular and noncircular permutations have been observed in protein structures that are not related by sequence homology, to date no one has attempted to rationally design and construct a protein with a sequence that is noncircularly permuted while conserving three-dimensional structure. Herein, we show that green fluorescent protein can be rewired, still functionally fold, and exhibit wild-type fluorescence excitation and emission spectra.
Background: Proteins have evolved subject to energetic selection pressure for stability and flexibility. Structural similarity between proteins that have gone through conformational changes can be captured effectively if flexibility is considered. Topologically unrelated proteins that preserve secondary structure packing interactions can be detected if both flexibility and Sequential permutations are considered. We propose the FlexSnap algorithm for flexible non-topological protein structural alignment. Results: The effectiveness of FlexSnap is demonstrated by measuring the agreement of its alignments with manually curated non-sequential structural alignments. FlexSnap showed competitive results against state-of-the-art algorithms, like DALI, SARF2, MultiProt, FlexProt, and FATCAT. Moreover on the DynDom dataset, FlexSnap reported longer alignments with smaller rmsd. Conclusions: We have introduced FlexSnap, a greedy chaining algorithm that reports both sequential and non-sequential alignments and allows twists (hinges). We assessed the quality of the FlexSnap alignments by measuring its agreements with manually curated non-sequential alignments. On the FlexProt dataset, FlexSnap was competitive to state-of-the-art flexible alignment methods. Moreover, we demonstrated the benefits of introducing hinges by showing significant improvements in the alignments reported by FlexSnap for the structure pairs for which rigid alignment methods reported alignments with either low coverage or large rmsd.
The remarkable predominance of right-handedness in beta-alpha-beta helical crossovers has been previously explained in terms of thermodynamic stability and kinetic accessibility, but a different kinetic trapping mechanism may also play a role. If the beta-sheet contacts are made before the crossover helix is fully formed, and if the backbone angles of the folding helix follows the energetic pathway of least resistance, then the helix would impart a torque on the ends of the two strands. Such a torque would tear apart a left-handed conformation but hold together a right-handed one. Right-handed helical crossovers predominate even in all-alpha proteins, where previous explanations based on the preferred twist of the beta sheet do not apply. Using simple molecular simulations, we can reproduce the right-handed preference in beta-alpha-beta units, without imposing specific beta strand geometry. The new kinetic trapping mechanism is dubbed the "phone cord effect" because it is reminiscent of the way a helical phone cord forms superhelices to relieve torsional stress. Kinetic trapping explains the presence of a right-handed superhelical preference in alpha helical crossovers, and provides a possible folding mechanism for knotted proteins. See Supplementary materials.
Structural similarity between proteins gives us insights into their evolutionary relationships when there is low sequence similarity. In this paper, we present a novel approach called SNAP for non-sequential pair-wise structural alignment. Starting from an initial alignment, our approach iterates over a two-step process consisting of a superposition step and an alignment step, until convergence. We propose a novel greedy algorithm to construct both sequential and non-sequential alignments. The quality of SNAP alignments were assessed by comparing against the manually curated reference alignments in the challenging SISY and RIPC datasets. Moreover, when applied to a dataset of 4410 protein pairs selected from the CATH database, SNAP produced longer alignments with lower rmsd than several state-of-the-art alignment methods. Classification of folds using SNAP alignments was both highly sensitive and highly selective. The SNAP software along with the datasets are available online at http://www.cs.rpi.edu/~zaki/software/SNAP.
Green fluorescent protein (GFP) has been used as a proof of concept for a novel "leave-one-out" biosensor design where a protein that has a segment omitted from the middle of the sequence by circularly permuting and truncating, binds the missing peptide and reconstitutes its function. Three variants of GFP have been synthesized that are each missing one of the eleven beta strands from its beta barrel structure, and in two of the variants adding the omitted peptide sequence in trans reconstitutes fluorescence. Detailed biochemical analysis indicates that GFP with beta-strand 7 "left out" (t7SPm) exists in a partially unfolded state. The apo-form t7SPm binds the free beta-strand 7 peptide with dissociation constant of ~0.5uM and folds into the native state of GFP, resulting in fluorescence recovery. Folding of t7SPm, both with and without the peptide ligand, is at least a three-state process and has a rate comparable to the full-length and unpermuted GFP. The conserved kinetic properties strongly suggest that the rate limiting steps in the folding pathway have not been altered by circular permutation and truncation in t7SPm. This study shows that structural and functional reconstitution of GFP can occur with a segment omitted from the middle of the chain, and that the unbound form is in a partially unfolded state.
Protein folding is a hierarchical process where structure forms locally first, then globally. Some short sequence segments initiate folding through strong structural preferences that are independent of their three-dimensional context in proteins. We have constructed a knowledge-based force field in which the energy functions are conditional on local sequence patterns, as expressed in the hidden Markov model HMMSTR. CALF (C-ALpha Force field) builds sequence specific statistical potentials based on database frequencies for alpha-carbon virtual bond opening and dihedral angles, pair-wise contacts and hydrogen bond donor-acceptor pairs, and simulates folding via Brownian dynamics. We introduce hydrogen bond donor and acceptor potentials as alpha-carbon probability fields that are conditional on the predicted local sequence. Constant temperature simulations were carried out using 27 peptides selected as putative folding initiation sites, each 12 residues in length, representing several different local structure motifs. Each 0.6 microsecond trajectory was clustered based on structure. Simulation convergence or representativeness was assessed by subdividing trajectories and comparing clusters. For 21 of the 27 sequences, the largest cluster made up more than half of the total trajectory. Of these 21 sequences, 14 had cluster centers that were at most 2.6A RMSD from their native structure in the corresponding full-length protein. To assess the adequacy of the energy function on non-local interactions, 11 full length native structures were relaxed using low-temperature Brownian dynamics. Equilibrated structures deviated from their native states but retained their overall topology and compactness. A simple potential that folds proteins locally and stabilizes proteins globally may enable a more realistic understanding of hierarchical folding pathways.
Amino acid sequence probability distributions, or profiles, have been used successfully to predict secondary structure and local structure in proteins. Profile models assume the statistical independence of each position in the sequence, but the energetics of protein folding is better captured in a scoring function that is based on pairwise interactions, like a force field. Results I-sites motifs are short sequence/structure motifs that populate the protein structure database due to energy-driven convergent evolution. Here we show that a pairwise covariant sequence model does not predict alpha helix or beta strand significantly better overall than a profile-based model, but it does improve the prediction of certain loop motifs. The finding is best explained by considering secondary structure profiles as multivariant, all-or-none models, which subsume covariant models. Pairwise covariance is nonetheless present and energetically rational. Examples of negative design are present, where the covariances disfavor non-native structures. Conclusions Measured pairwise covariances are shown to be statistically robust in cross-validation tests, as long as the amino acid alphabet is reduced to nine classes. Availability: An updated I-sites local structure motif library that pro-vide sequence covariance information for all types of local structure in globular proteins and a web server for local structure prediction are available at www.bioinfo.rpi.edu/bystrc/hmmstr/server.php .
Summary: Most proteins are in equilibrium with partially and globally unfolded conformations. In contrast, kinetically stable proteins (KSPs) are trapped by an energy barrier in a specific state, unable to transiently sample other conformations. Among many potential roles, it appears that kinetic stability (KS) is a feature used by nature to allow proteins to maintain activity under harsh conditions and to preserve the structure of proteins that are prone to misfolding. The biological and pathological significance of KS remain very poorly understood due to the lack of simple experimental methods to identify this property, and its infrequent occurrence in proteins. Based on our previous correlation between KS and a proteins resistance to the denaturing detergent sodium dodecyl sulfate (SDS), we show here the application of a diagonal two-dimensional (D2D) SDS-polyacrylamide gel electrophoresis (PAGE) assay to identify KSPs in complex mixtures. We applied this method to the lysate of E. coli, and upon proteomics analysis have identified 50 non-redundant proteins that were SDS resistant (i.e. putatively kinetically stable), either individually or as part of a protein complex. Structural and functional analyses of a subset (44) of these proteins with known 3D structure revealed some potential structural and functional biases towards and against KS. This simple D2D SDS-PAGE assay will allow the widespread investigation of KS, including the proteomics-level identification of KSPs in different systems, potentially leading to a better understanding of the biological and pathological significance of this intriguing property of proteins.
Summary: We describe an efficient method for partial complementary shape matching for use in rigid protein-protein docking. The local shape features of a protein are represented using boolean data structures called context shapes. The relative orientation of the receptor and ligand surfaces is searched using pre-calculated lookup tables. Energetic quantities are derived from shape complementarity and buried surface area computations using efficient boolean operations. Preliminary results indicate that our context shapes based approach outperforms stateof-the-art geometric shape based rigid docking algorithms like ZDOCK(PSC) and PatchDock. Binary code of the implementation is available on request. The code will be available for downloading once the project website is set up.
Summary: Hidden Markov models (HMMs) are an extremely versatile statistical representation that can be used to model any set of one-dimensional discrete symbol data. HMMs can model protein sequences in many ways, depending on what features of the protein are represented by the Markov states. For protein structure prediction, states have been chosen to represent either homologous sequence positions, local or secondary structure types, or transmembrane locality. The resulting models can be used to predict common ancestry, secondary or local structure, or membrane topology by applying one of the two standard algorithms for comparing a sequence to a model. In this chapter we review those algorithms and discuss how HMMs have been constructed and refined for the purpose of protein structure prediction.
Proteins are linear chains that fold into characteristic shapes and features. To understand proteins and protein folding, we try to represent the protein molecule in such a way that its features are easy to see and manipulate. A simple representation facilitates algorithm design for structure prediction. The simplicity of the 3-state character string representation of secondary structure is part of the reason for secondary structure prediction receiving so much attention early in the era of computational biology. One-dimensional strings are easily understood, parsed, mined and manipulated. But secondary structure alone does not tell us enough about the overall shapes and features of a protein. We need a simple way to represent the overall tertiary structure of a protein.
Here we explore a two-dimensional Boolean matrix representation of protein structure, where each dimension is the residue number and each value is true if the residues are spatial neighbors and false otherwise -- called a contact map. A contact map is the simplest representation of a protein that can be faithfully projected back into three dimensions. As such it has received increased attention in recent years from bioinformaticists, who see this as a data structure that is readily amenable to data mining and machine learning.
Motivation: In recent years, advances have been made in the ability of
computational methods to discriminate between homologous and non-homologous
proteins in the "Twilight Zone" of sequence similarity, where the percent
sequence identity is a poor indicator of homology. To make these predictions
more valuable to the protein modeler, they must be accompanied by accurate
alignments. Pairwise sequence alignments are inferences of orthologous
relationships between sequence positions. Evolutionary distance is
traditionally modeled using global amino acid substitution matrices. But real
differences in the likelihood of substitutions may exist for different
structural contexts within proteins, since structural context contributes to
the selective pressure.
Results: HMMSUM (HMMSTR-based SUbstitution matrices) is a new model for structural context-based amino acid substitution probabilities consisting of a set of 281 matrices, each for a different sequence-structure context. HMMSUM does not require the structure of the protein to be known. Instead, predictions of local structure are made using HMMSTR, a hidden Markov model for local structure. Align-ments using the HMMSUM matrices compare favorably to alignments carried out using the BLOSUM50 matrix when validated against curated remote homolog alignments from BAliBASE. HMMSUM has been implemented using local Dynamic Programming and with the Bayesian Adaptive alignment method.
Availability: Matrices and programs are available at http://www.bioinfo.rpi.edu/bystrc/downloads.html.
Contact: email@example.com, firstname.lastname@example.org
Summary: We present a method for constructing thousands of compact protein conformations from fragments and then connecting these structures to form a network of physically plausible folding pathways. This is the first attempt to merge the previous successes in fragment assembly methods with probabilistic roadmap (PRM) methods. Previous PRM methods have used the knowledge of the true structure to sample conformational space. Our method uses only the amino acid sequence to bias the conformational sampling. Conformational sampling is done using HMMSTR, a hidden Markov model for local sequence-structure correlations. We then build a PRM graph and find paths that have the the lowest energy climb. We find that favored folding pathways exist, corresponding to deep valleys in the energy landscape. We describe the pathways for three small proteins with different secondary structure content in the context of a folding funnel model.
ECOME is an interactive, graph-based model for simulating an evolving, closed consumption web. It demonstrates the fundamental behavior of a global ecosystem over evolutionary time using wellestablished ecological/evolutionary principles. Nodes in the graph send biomass along weighted, directed edges. New nodes evolve by speciation and disappear when biomass (i.e. population) shrinks to zero. Consumption rates, predator/prey relationships, and speciation rates are user-defined, following theoretic distributions. The output shows the biomass and biodiversity over time for up to five trophic levels. Using this simple system, we demonstrate that closed ecosystems are inherently unstable in the absence of evolution or in the presence of a single, hyperchanging species, but are dynamically stable and robust to perturbations when the evolution rates for all species follow a normal distribution. Our new application provides provocative lessons for biology students during a time of mass extinction.
Motivation: Proteins of the same class often share a secondary structure packing arrangement but differ in how the secondary structure units are ordered in the sequence. We find that proteins that share a common core also share local sequence-structure similarities, and these can be exploited to align structures with different topologies. In this study, segments from a library of local sequence-structure alignments were assembled hierarchically, enforcing the compactness and conserved inter-residue contacts but not sequential ordering. Previous structure-based alignment methods often ignore sequence similarity, local structural equivalence, and compactness.
Results: The new program, SCALI (Structural Core ALIgnment), can efficiently find conserved packing arrangements, even if they are non-sequentially ordered in space. SCALI alignments conserve remote sequence similarity and contain fewer alignment errors. Clustering of our pairwise non-sequential alignments shows that recurrent packing arrangements exist in topologically different structures. For example, the 3-layer sandwich domain architecture may be divided into four structural subclasses based on internal packing arrangements. These subclasses represent an intermediate level of structure classification, more general than topology but more specific than architecture as defined in CATH. A strategy is presented for developing a set of predictive hidden Markov models based on multiple SCALI alignments.
Availability: An online topology independent SCALI structure comparison server is available at http://www.bioinfo.rpi.edu/bystrc/scali.html.
Summary: A structured folding pathway, which is a time
ordered sequence of folding events, plays an important role in
the protein folding process and hence, in the conformational
search. Pathway prediction, thus gives more insight into the
folding process and is a valuable guiding tool
to search the conformation space.
In this paper, we propose a novel unfolding
approach to predict the folding pathway. We apply graph-based
methods on a weighted secondary structure graph of a protein
to predict the sequence of unfolding events. When viewed in
reverse this yields the folding pathway. We demonstrate the
success of our approach on several proteins whose pathway
is partially known.
Remote homology detection refers to the detection of structural homology in proteins when there is little or no sequence similarity. In this article, we present a remote homolog detection method called SVM-HMMSTR that overcomes the reliance on detectable sequence similarity by transforming the sequences into strings of hidden Markov states that represent local folding motif patterns.These state strings are transformed into fixed dimension feature vectors for input to a support vector machine. Two sets of features are defined: an order-independent feature set that captures the amino acid and local structure composition; and an order-dependent feature set that captures the sequential ordering of the local structures. Tests using the Structural Classification of Proteins (SCOP)1.53 data set show that the SVM-HMMSTR gives a significant improvement over several current methods. Proteins 2004;57:518-30.
A review of recent work toward modeling the protein folding pathway using a bioinformatics approach is presented. Statistical models have been developed for sequence-structure correlations in proteins at five levels of structural complexity: (1) short motifs, (2) extended motifs, (3) non-local pairs of motifs, (4) three dimensional arrangements of multiple motifs, and (5) global structural homology. Here we review statistical models, including sequence profiles, hidden Markov models and interaction potentials, for the first four levels of structural detail. The I-sites Library (folding Initiation sites) models local structure motifs. HMMSTR (Hidden Markov Model for STRucture) is a hidden Markov model for extended motifs. HMMSTR-CM (Contact Maps) is a model for pairwise interactions between motifs. And SCALI-HMM (HMMs for Structural Core Alignments) is a set of hidden Markov models for spatial arrangements of motifs. Global sequence models have been extensively reviewed elsewhere and are not discussed here. The parallels between the statistical models and the theoretical models for folding pathways are discussed.
Access to the data used and algorithms presented in this paper are available at http://www.bioinfo.rpi.edu/bystrc/ or by request to email@example.com. HMMSTR predictions may be obtained from this web site: http://www.bioinfo.rpi.edu/bystrc/hmmstr/server.php
Knowledge-based potential functions for protein structure prediction assume that the frequency of occurrence of a given structure or a contact in the protein database is a measure of its free energy. Here, we put this assumption to test by comparing the results obtained from sequence-structure cluster analysis with those obtained from long all-atom molecular dynamics simulations. Sixty-four eight-residue peptide sequences with varying degrees of similarity to the canonical sequence pattern for amphipathic helix were drawn from known protein structures, regardless of whether they were helical in the protein. Each was simulated using AMBER6.0 for at least 10 ns using explicit waters. The total simulation time was 1176 ns. The resulting trajectories were tested for reproducibility, and the helical content was measured. Natural peptides whose sequences matched the amphipathic helix motif with greater than 50% confidence were significantly more likely to form helix during the course of the simulation than peptides with lower confidence scores. The sequence pattern derived from the simulation data closely resembles the motif pattern derived from the database cluster analysis. The difficulties encountered in sampling conformational space and sequence space simultaneously are discussed. Key words: Proteins 2003;50:552-562.
The function of an unknown biological sequence can often be accurately inferred if we are able to map this unknown sequence to its corresponding homologous family. Currently, discriminative approach which combines support vector ma-chine and sequence similarity is recognized as the most ac-curate approach. SVM-Fisher and SVM-pairwise methods are two representatives of this approach, and SVM-pairwise is the most accurate method. However, these methods only encode sequence information into their feature vectors and ignore the structure information. In addition, one of their major drawbacks is their computation inefficiency. Based on this observation, we present an alternative method for SVM-based protein classification. Our method, SVM-I-sites, uses structure similarity instead of sequence similarity for remote homology detection. Our studies show that SVM-I-sites is much more efficient than both SVM-Fisher and SVM-pairwise while achieving a comparable performance with SVM-pairwise.
Result: We adopt SCOP 1.53 as our dataset. The result shows that SVM-I-sites runs much faster and is able to out-perform many state-of-the-art sequence-based methods such as PSI-BLAST, SAM and SVM-Fisher, and comparable to SVM-pairwise.
Availability: I-sites server is accessible through the web at http://www.bioinfo.rpi.edu.
Programs are available upon request for academics. Licensing agreements are available for commercial interests. The framework of encoding local structure into feature vector is available upon request.
We present a novel method, HMMSTR-CM, for protein contact map predictions. Contact potentials were calculated using HMMSTR, a hidden Markov model for local sequence structure correlations. Targets were aligned against protein templates using a Bayesian method and contact maps were generated using these alignments. Contact potentials then were used to evaluate these templates. An ab initio method was developed based on the target contact potentials using a rule-based strategy to model the protein folding pathway. Fold recognition and ab initio methods were combined to produce accurate, protein-like contact maps. Pathways sometimes led to an unambiguous prediction of topology, even without using templates. The results on CASP5 targets are discussed. Also included is a brief update on the quality of fully automated ab initio predictions using the I-sites server.
Proteins fold through a series of intermediate states called a pathway. Protein folding pathways have been modeled using either simulations or a heirarchy of statistical models. Here we present a series of related statistical models that at-tempt to predict early, middle and late intermediates along the folding pathway. I-sites motifs are discrete models for folding initiation sites. HMMSTR is a model for local structure patterns composed of I-sites motifs. HMMSTR-CM is an ap-proach toward assembling motifs and groups of motifs in a contact map represen-tation, using heuristic rules to predict contact maps either with or without the use of templates. We also discuss the I-sites/ROSETTA server, which is a folding simulation algorithm that uses a fragment library as input. The results of blind structure prediction experiments are discussed. Pathway-based predictions some-times lead to an unambiguous prediction of the fold topology, even without using templates.
Ab initio prediction is the challenging attempt to predict protein structures based only on sequence information and without using templates. It is often divided into two distinct sub-problems: (1) the scoring function that can distinguish between native or native-like structures from non-native ones, and (2) the method of searching the conformational space. Currently there does not exist a reliable scoring function that can always drive a search to the native fold, and there is no general search method that can guarantee a significant sampling of near-natives. Pathway models combine the scoring function and the search. In this short review, we explore some of the ways pathway models are used in folding, in published works since 2001, and present a new pathway model HMMSTR-CM, that uses a fragment library and a set of nucleation/propagation-based rules. The new method was used for ab initio predictions as part of CASP5. This work was presented at the Winter School in Bioinformatics, Bologna, Italy, Feb 10-14, 2003.
A fast algorithm for computing the solvent accessible molecular surface area (SAS) using Boolean masks (Le Grand, S. M. & Merz, K. M. J. (1993). J. Comp. Chem. 14, 349-52.) has been modified to estimate the solvent excluded molecular surface area (SES), including contact, toroidal and reentrant surface components. Numerical estimates of arc lengths of intersecting atomic SAS are using to estimate the toroidal surface, and intersections between those arcs are used to estimate the reentrant surface area. The new method is compared to an exact analytical method. Boolean molecular surface areas are continuous and pairwise differentiable, and should be useful for molecular dynamics simulations, especially as the basis for an implicit solvent model.
Motivation: The Monte Carlo fragment insertion method for protein tertiary structure prediction (ROSETTA) of Baker and others, has been merged with the I-SITES library of sequence structure motifs and the HMMSTR model for local structure in proteins, to form a new public server for the ab initio prediction of protein structure. The server performs several tasks in addition to tertiary structure prediction, including a database search, amino acid profile generation, fragment structure prediction, and backbone angle and secondary structure prediction. Meeting reasonable service goals required improvements in the efficiency, in particular for the ROSETTA algorithm. Results: The new server was used for blind predictions of 40 protein sequences as part of the CASP4 blind structure prediction experiment. The results for 31 of those predictions are presented here. 61% of the residues overall were found in topologically correct predictions, which are defined as fragments of 30 residues or more with a root-mean-square deviation in superimposed alpha carbons of less than 6A. HMMSTR 3-state secondary structure predictions were 73% correct overall. Tertiary structure predictions did not improve the accuracy of secondary structure prediction. Availability:The server is accessible through the web atwww.bioinfo.rpi.edu/bystrc/hmmstr/server.php
Torsion space molecular dynamics may be more efficiently encoded if the global motions are separated from the internal motions. The equations of motion for single, non-cyclic chains are shown to be first order in the backbone angle parameters when the global frame of reference is ignored, and second order otherwise. Adding a simple heuristic substitute for the global motions enables the encoding of dynamics for mixed constrained/un-constrained model systems.
We describe a hidden Markov model, HMMSTR, for general protein sequence based on the I-sites library of sequence-structure motifs. Unlike the linear hidden Markov models used to model individual protein families, HMMSTR has a highly branched topology and captures recurrent local features of protein sequences and structures that transcend protein family boundaries. The model extends the I-sites library by describing the adjacencies of different sequence-structure motifs as observed in the protein database and, by representing overlapping motifs in a much more compact form, achieves a great reduction in parameters. The HMM attributes a considerably higher probability to coding sequence than does an equivalent dipeptide model, predicts secondary structure with an accuracy of 74.3 %, backbone torsion angles better than any previously reported method and the structural context of beta strands and turns with an accuracy that should be useful for tertiary structure prediction.
Locating protein coding regions in genomic DNA is a critical step in accessing the information generated by large scale sequencing projects. Current methods for gene detection depend on statistical measures of content differences between coding and noncoding DNA in addition to the recognition of promoters, splice sites, and other regulatory sites. Here we explore the potential value of recurrent amino acid sequence patterns 3-19 amino acids in length as a content statistic for use in gene finding approaches. A finite mixture model incorporating these patterns can partially discriminate protein sequences which have no (detectable) known homologs from randomized versions of these sequences, and from short (< or = 50 amino acids) non-coding segments extracted from the S. cerevisiea genome. The mixture model derived scores for a collection of human exons were not correlated with the GENSCAN scores, suggesting that the addition of our protein pattern recognition module to current gene recognition programs may improve their performance.
In this paper, we develop data mining techniques to predict 3D contact potentials among protein residues (or amino acids) based on the hierarchical nucleation-propagation model of protein folding. We apply a hybrid approach, using a hidden Markov model (HMM) to extract folding initiation sites, and then apply association mining to discover contact potentials. The new hybrid approach achieves accuracy results better than those reported previously (13 Refs.)
We describe the development of a scoring function based on the decomposition P(structure/sequence) proportional to P(sequence/structure) *P(structure), which outperforms previous scoring functions in correctly identifying native-like protein structures in large ensembles of compact decoys. The first term captures sequence- dependent features of protein structures, such as the burial of hydrophobic residues in the core, the second term, universal sequence- independent features, such as the assembly of beta-strands into beta- sheets. The efficacies of a wide variety of sequence-dependent and sequence-independent features of protein structures for recognizing native-like structures were systematically evaluated using ensembles of approximately 30,000 compact conformations with fixed secondary structure for each of 17 small protein domains. The best results were obtained using a core scoring function with P(sequence/structure) parameterized similarly to our previous work (Simons et al., J Mol Biol 1997;268:209-225] and P(structure) focused on secondary structure packing preferences; while several additional features had some discriminatory power on their own, they did not provide any additional discriminatory power when combined with the core scoring function. Our results, on both the training set and the independent decoy set of Park and Levitt (J Mol Biol 1996;258:367-392), suggest that this scoring function should contribute to the prediction of tertiary structure from knowledge of sequence and secondary structure.
We describe a new method for local protein structure prediction based on a library of short sequence pattern that correlate strongly with protein three-dimensional structural elements. The library was generated using an automated method for finding correlations between protein sequence and local structure, and contains most previously described local sequence-structure correlations as well as new relationships, including a diverging type-II beta-turn, a frayed helix, and a proline-terminated helix. The query sequence is scanned for segments 7 to 19 residues in length that strongly match one of the 82 patterns in the library. Matching segments are assigned the three-dimensional structure characteristic of the corresponding sequence pattern, and backbone torsion angles for the entire query sequence are then predicted by piecing together mutually compatible segment predictions. In predictions of local structure in a test set of 55 proteins, about 50% of all residues, and 76% of residues covered by high-confidence predictions, were found in eight-residue segments within 1.4 A of their true structures. The predictions are complementary to traditional secondary structure predictions because they are considerably more specific in turn regions, and may contribute to ab initio tertiary structure prediction and fold recognition.
Previous studies of the conformations of peptides spanning the length of the alpha-spectrin SH3 domain suggested that SH3 domains lack independently folding substructures. Using a local structure prediction method based on the I-sites library of sequence-structure motifs, we identified a seven residue peptide in the src SH3 domain predicted to adopt a native-like structure, a type II beta-turn bridging unpaired beta-strands, that was not contained intact in any of the SH3 domain peptides studied earlier. NMR characterization confirmed that the isolated peptide, FKKGERL, adopts a structure similar to that adopted in the native protein: the NOE and 3JNHalpha coupling constant patterns were indicative of a type II beta-turn, and NOEs between the Phe and the Leu side-chains suggest that they are juxtaposed as in the prediction and the native structure. These results support the idea that high-confidence I-sites predictions identify protein segments that are likely to form native-like structures early in folding. Copyright 1998 Academic Press.
Blind predictions of the local structure of nine CASP2 targets were made using the I-sites library of short sequence--structure motifs, revealing strengths and weaknesses in this new knowledge-based method. Many turns between secondary structural elements were accurately predicted. Estimates of the confidence of prediction correlated well with the accuracy over the whole set. Bias toward structures used to develop the library was minimal, probably because of the extensive use of cross-validation. However, helix positions were better predicted by the PHD program. The method is likely to be sensitive to the quality of the sequence alignment. A general measure for evaluating local structure predictions is suggested.
We have used cluster analysis to identify recurring sequence patterns that transcend protein family boundaries. A subset of these patterns occur predominantly in a single type of local structure in proteins. Here we characterize the three-dimensional structures and contexts in which these sequence patterns occur, with particular attention to the interactions responsible for their structural selectivity.
Considerable progress has been made in understanding the relationship between local amino acid sequence and local protein structure. Recent highlights include numerous studies of the structures adopted by short peptides, new approaches to correlating sequence patterns with structure patterns, and folding simulations using simple potentials.
The 2.4 A crystal structure (R = 0.180) of the serine protease inhibitor ecotin was determined in a complex with trypsin. Ecotin's dimer structure provides a second discrete and distal binding site for trypsin and, as shown by modelling experiments, other serine proteases. The second site is approximately 45 A from the reactive/active site of the complex and features 13 hydrogen bonds, including six that involve carbonyl oxygen atoms and four bridged by water molecules. Contacts ecotin makes with trypsin's active site are similar to, though more extensive than, those found between trypsin and basic pancreatic trypsin inhibitor. The side chain of ecotin Met84 is found in the substrate binding pocket of trypsin where it makes few contacts, but also does not disrupt the solvent structure or cause misalignment of the scissile bond. This first case of protein dimerization being used to augment binding energy and allow chelation of a target protein provides a new model for protein-protein interactions and for protease inhibition.
The authors describe the further development of phase refinement by iterative skeletonization (PRISM), a recently introduced phase-refinement strategy which makes use of the information that proteins consist of connected linear chains of atoms. An initial electron-density map is generated with inaccurate phases derived from a partial structure or from isomorphous replacement. A linear connected skeleton is then constructed from the map using a modified version of Greer's algorithm (1985) and a new map is created from the skeleton. This 'skeletonized' map is Fourier transformed to obtain new phases, which are combined with any starting-phase information and the experimental structure-factor amplitudes to produce a new map. The procedure is iterated until convergence is reached. In the paper significant improvements to the method are described as is a challenging molecular-replacement test case in which initial phases are calculated from a model containing only one third of the atoms of the intact protein (15 Refs.)
A phase-refinement strategy for protein crystallography which exploited the information that proteins consist of connected linear chains of atoms is applied to a molecular-replacement problem, the structure of the protease inhibitor ecotin bound to trypsin, and a single isomorphous replacement problem, the structure of the N-terminal domain of apolipoprotein E. The starting phases for the ecotin-trypsin complex were based on a partial model (trypsin) containing 61% of the atoms in the complex. Iterative skeletonization gave better results than either solvent flattening or twofold non-crystallographic symmetry averaging as measured by the reduction in the free R factor. Protection of the trypsin density during the course of the refinement greatly improved the performance of both skeletonizing and solvent flattening. In the case of apolipoprotein E, the combination of iterative skeletonization and solvent flattening decreased the phase error with respect to the final refined structure, significantly more than solvent flattening alone (20 Refs.)
The crystal structure of subtilisin BL, an alkaline protease from Bacillus lentus with activity at pH 11, has been determined to 1.4 A resolution. The structure was solved by molecular replacement starting with the 2.1 A structure of subtilisin BPN' followed by molecular dynamics refinement using X-PLOR. A final crystallographic R-factor of 19% overall was obtained. The enzyme possesses stability at high pH, which is a result of the high pI of the protein. Almost all of the acidic side-chains are involved in some type of electrostatic interaction (ion pairs, calcium binding, etc.). Furthermore, three of seven tyrosine residues have potential partners for forming salt bridges. All of the potential partners are arginine with a pK around 12. Lysine would not function well in a salt bridge with tyrosine as it deprotonates at around the same pH as tyrosine ionizes. Stability at high pH is acquired in part from the pI of the protein, but also from the formation of salt bridges (which would affect the pI). The overall structure of the enzyme is very similar to other subtilisins and shows that the subtilisin fold is more highly conserved than would be expected from the differences in amino acid sequence. The amino acid side-chains in the hydrophobic core are not conserved, though the inter- residue interactions are. Finally, one third of the serine side-chains in the protein have multiple conformations. This presents an opportunity to correlate computer simulations with observed occupancies in the crystal structure.
The crystal structure of unliganded dihydrofolate reductase (DHFR) from Escherichia coli has been solved and refined to an R factor of 19% at 2.3-A resolution in a crystal form that is nonisomorphous with each of the previously reported E. coli DHFR crystal structures [Bolin, J. T., Filman, D. J., Matthews, D. A., Hamlin, B. C., & Kraut, J. (1982) J. Biol. Chem. 257, 13650-13662; Bystroff, C., Oatley, S. J., & Kraut, J. (1990) Biochemistry 29, 3263-3277]. Significant conformational changes occur between the apoenzyme and each of the complexes: the NADP+ holoenzyme, the folate-NADP+ ternary complex, and the methotrexate (MTX) binary complex. The changes are small, with the largest about 3 A and most of them less than 1 A. For simplicity a two-domain description is adopted in which one domain contains the NADP+ 2'-phosphate binding site and the binding sites for the rest of the coenzyme and for the substrate lie between the two domains. Binding of either NADP+ or MTX induces a closing of the PABG-binding cleft and realignment of alpha- helices C and F which bind the pyrophosphate of the coenzyme. Formation of the ternary complex from the holoenzyme does not involve further relative domain shifts but does involve a shift of alpha-helix B and a floppy loop (the Met-20 loop) that precedes alpha B. These observations suggest a mechanism for cooperativity in binding between substrate and coenzyme wherein the greatest degree of cooperativity is expressed in the transition-state complex. We explore the idea that the MTX binary complex in some ways resembles the transition-state complex.
The crystal structure of dihydrofolate reductase (EC 184.108.40.206) from Escherichia coli has been solved as the binary complex with NADP+ (the holoenzyme) and as the ternary complex with NADP+ and folate. The Bragg law resolutions of the structures are 2.4 and 2.5 A, respectively. The new crystal forms are nonisomorphous with each other and with the methotrexate binary complex reported earlier [Bolin, J. T., Filman, D. J., Matthews, D. A., Hamlin, R. C., & Kraut, J. (1982) J. Biol. Chem. 257, 13650-13662]. In general, NADP+ and folate binding conform to predictions, but the nicotinamide moiety of NADP+ is disordered in the holoenzyme and ordered in the ternary complex. A mobile loop (residues 16-20) involved in binding the nicotinamide is also disordered in the holoenzyme. We report a detailed analysis of the binding interactions for both ligands, paying special attention to several apparently strained interactions that may favor the transition state for hydride transfer. Hypothetical models are presented for the binding of 7,8- dihydrofolate in the Michaelis complex and for the transition-state complex.