Improving hybrid statistical and physical forcefields through local structure enumeration

Patrick Conway; Frank DiMaio

doi:10.1002/pro.2956

. 2016 Jun 16;25(8):1525–1534. doi: 10.1002/pro.2956

Improving hybrid statistical and physical forcefields through local structure enumeration

Patrick Conway ¹, Frank DiMaio ^1,^2,^✉

PMCID: PMC4972208 PMID: 27239808

Abstract

Forcefields used in biomolecular simulations are comprised of energetic terms that are physical in nature, based on parameter fitting to quantum mechanical simulation or experimental data, or statistical, drawing off high‐resolution structural data to describe distributions of molecular features. Combining the two in a single forcefield is challenging, since physical terms describe some, but not all, of the observed statistics, leading to double counting. In this manuscript, we develop a general scheme for correcting statistical potentials used in combination with physical terms. We apply these corrections to the sidechain torsional potential used in the Rosetta all‐atom forcefield. We show the approach identifies instances of double‐counted interactions, including electrostatic interactions between sidechain and nearby backbone, and steric interactions between neighboring Cβ atoms within secondary structural elements. Moreover, this scheme allows for the inclusion of intraresidue physical terms, previously turned off to avoid overlap with the statistical potential. Combined, these corrections lead to a forcefield with improved performance on several structure prediction tasks, including rotamer prediction and native structure discrimination.

Keywords: protein structure prediction, torsional potential, rotamer library, sidechain prediction, Rosetta

Introduction

Commonly used biomolecular forcefields1, 2, 3, 4, 5, 6 fall broadly into two classes: knowledge‐based forcefields, where pairwise atom distributions are converted to an energy function using the Boltzmann distribution, and empirical, or molecular mechanics (MM), forcefields, where a set of score terms characterizing physical interactions are parameterized and fit to experimental and quantum mechanical simulation data on small molecules. The Rosetta forcefield7, 8 takes a combined approach, where individual score terms model physical phenomena, but are parameterized against statistics from high‐resolution macromolecular structures. Using these sources of information in combination offers some advantages: the tens of thousands of known high‐resolution structures provides a larger pool of data for parameter fitting, as well as accounting for the environment differences between a protein core and small molecules in solution. A significant drawback to this approach is the redundancy that arises when a statistical potential implicitly contains contributions from interactions already accounted for by a physical term.9

Several approaches have been developed to deal with these “double counting” interactions. Generally, previous approaches have iterated between adjusting parameters of the forcefield, measuring some output distribution from refined structures, and comparing this distribution to some target distribution. Forcefield parameters are retuned until the distribution of refined models agrees with the target distribution. In the context of Rosetta, this approach has been used to correct double‐counting interactions between torsional potentials and other energy function components,9 as well as overlap between hydrogen bonding, electrostatics, and Lennard–Jones interactions.8 A related approach is performed to parameterize torsional potentials in molecular dynamics forcefields.10, 11, 12 While effective, these iterative approaches are time‐consuming; furthermore, it is necessary to perform these changes each time a scorefunction component changes.

In this manuscript, we introduce a rapid and automatic fragment‐based enumerative approach for remedying double counting artifacts in biomolecular energy functions. Our approach explicitly computes portions of statistical torsional energies already accounted for by the physical energy terms, and applies this as a correction factor to the statistical terms. To test the effectiveness of this approach, we apply this method to correct double counting interactions in the Rosetta sidechain torsional potential. We identify several specific cases where these double counting interactions are taking place, some of which have previously been identified. Finally, we show that these corrections result in improved performance on several structure prediction tasks.

Results

An overview of our enumerative forcefield correction strategy is fully described in “Methods” section. Briefly, the approach uses a large set of local backbone fragments taken from previously solved structures. We compute a “mean‐field” energy for each fragment using only physical energy terms, aggregated over all fragments with a given Ψ/Φ setting and amino acid identity. Using the Boltzmann distribution, these scores are converted to probabilities, which are then applied as a correction factor to the corresponding statistical energy terms. In this way, we are “cancelling out” the statistical energies already accounted for by the physical potential.

Figure 1 presents an overview of this strategy, as applied to the Rosetta backbone‐dependent statistical sidechain torsional potential13 (E _rotlib). This potential is a statistical potential fit to high‐resolution crystal structures and smoothed with adaptive kernel density estimates, correspondingly converted to an energy using the Boltzmann distribution [Fig. 1(B), left]. Portions of this probability distribution already accounted for by physical energy terms in Rosetta are thus double counted [Fig. 1(B), right]. To address this double counting, we estimate the mean field energy, Ê, as the average physical energy for a structure with the given sidechain and backbone torsions [Fig. 1(C)]. We subtract this from the energy given directly from the statistical potential, E _rot, such that the new potential $E_{rot}^{*}$ [Fig. 1(D)] when combined with this mean field energy, gives back the original statistical potential. That is:

E_{rot}^{*} (χ_{1.4} | ψ, ϕ) = E_{rot} (χ_{1.4} | ψ, ϕ) - \bar{E} (χ_{1.4} | ψ, ϕ)

An illustration of our strategy for computing corrections to the sidechain torsional potential. A: Serine is modeled in Rosetta using one of three potential rotamers observed at the given backbone torsions. B: The rotamer library encodes frequencies of each rotamer observed in crystal structures. Due to double counting with physical terms in the Rosetta energy function, the modeled distribution does not match the observed distribution. C: We account for the double counting from physical terms by computing a mean field energy for each of the rotamers. This is done by building and minimizing a given rotamer on a diverse set of fragments sharing the same backbone torsion and averaging the physical energies. D: We generate a new rotamer library by subtracting the mean field energies from the rotamer library energies. The new library will correctly sum up with physical energies during modeling to recapitulate the frequencies of the original library.

To compute $\bar{E}$ , we use enumerative sampling of local structures [Fig. 1(C)]. For a given Ψ/Φ and amino‐acid identity, we find all fragments from high‐resolution structures adopting that particular Ψ/Φ and with that particular identity at the center position. We then enumerate all possible settings of sidechain torsions $χ_{1.4}$ , and evaluate their physical energy. Repeating this for every high‐resolution fragment with the given Ψ/Φ and averaging gives us the value of Ê for a single setting.

As an initial test, we applied this protocol to the default Rosetta score function (Talaris20148). In particular, we wanted to evaluate the ability of the corrected scorefunction to predict core and interface sidechain rotamers in their native context using a rotamer recovery protocol called RTMin.14 In RTMin, for a given sidechain, every rotamer is built and minimized, with the lowest energy conformation chosen. Recovery of native conformation is determined by comparing the predicted sidechain to density maps derived from the deposited high‐resolution crystallographic data. If the native sidechain is well resolved, and the predicted sidechain shows similar agreement to the experimental data, the prediction is correct [Fig. 2(A)]. If the native sidechain is well resolved and the predicted sidechain has a significantly lower correlation, the prediction is wrong [Fig. 2(B)]. By evaluating predictions against experimental data—in contrast to comparing predictions directly to the native structure—this metric avoids noise due to evaluating disordered sidechains that have been modeled in the deposited structure [Fig. 2(C)]. The sensitivity of this metric is relatively high: for very well‐resolved sidechains it will fail if χ ₁ or χ ₂ deviates more than 20 degree. This evaluation was carried out on a set of high‐resolution crystal structures not used in computing the corrections and parameter tuning, and was further subdivided into predictions in monomeric proteins and predictions at protein–protein interfaces.

Agreement to high‐resolution experimental density is used to evaluate sidechain rotamer prediction. A: The predicted sidechain conformation (cyan) agrees with the native sidechain (green), and the native sidechain is well resolved in density, indicating a correct prediction. B: The predicted sidechain conformation disagrees with the native sidechain, and the native sidechain is well resolved in density, indicating an incorrect prediction. C: The predicted sidechain conformation disagrees with the native sidechain, but the native is poorly resolved in density. This prediction is not included in the calculations.

Our scheme identified a number of double‐counting pathologies, some of which are illustrated in Figure 3. One major source of these errors is the electrostatics term in Rosetta. For example, Figure 3(A,B) shows cases where double counting between the rotameric potential and electrostatics leads to a bias toward rotamers that form favorable electrostatic interactions with nearby backbone atoms. In the cases shown, this bias leads toward a favoring of these local backbone interactions rather than correctly forming a hydrogen bond to a residue distant in sequence. After correcting the probability of the incorrect rotamer (Fig. 3, middle) to account for the electrostatics contribution from the physical term (Fig. 3, black box in left column), the correct rotamer (Fig. 3, right) is now favored.

Energy corrections identify and resolve rotamer library‐based sidechain prediction errors. Each row demonstrates an example of a corrected prediction error. The energy corrections column shows the backbone‐dependent change in correction magnitude for the rotamer shown in the error column where the black box highlights the ψ/ϕ setting of the given example. The largest correction (in red) represents a penalty of three energy units. The error column shows the structure of the incorrectly placed rotamer, while the corrected column shows the (correct) rotamer placement following application of the correction factors. Rows A to C demonstrate double counting errors caused by score terms that reward interactions with adjacent residues: (A) serine making favorable electrostatic interactions with nearby backbone atoms; (B) glutamate making favorable electrostatic interactions with nearby backbone atoms; and (C) arginine making a hydrogen‐bonding interaction with nearby backbone. After corrections, all three rotamers are correctly placed in conformations making long‐range interactions. Rows D and E show rotamer library errors occurring at the end of helices: (D) the correct glutamate rotamer is disfavored as it would clash in the center of a helix, but is preferred for capping a helix; and (E) isoleucine would clash in a straight helix and is thus disfavored, but does not in a kinked helix.

Additionally, there are also cases where hydrogen bonding with an adjacent residue leads to double counting, and, consequently, a bias toward a particular rotamer. Figure 3(C) shows one such case, where the uncorrected energy function prefers that an arginine make a relatively weak hydrogen bond to an adjacent backbone residue due to double counting with the sidechain torsion potential. After applying our correction scheme, the energy function now correctly prefers the native rotamer.

A third class of double‐counting interactions was identified in residues at the N‐terminus of helices [Fig. 3(D)]. While studies have demonstrated the influence of helix‐capping motifs on residue and rotamer distributions, the Rosetta sidechain torsional potential does not consider the backbone torsion angles of neighboring residues15, 16, 17 (however, there has been some previous effort on capturing preferences of backbone torsions in neighboring pairs18). Consequently, sidechains with helical backbone torsions have rotamer probabilities biased to those found along the middle of the helix rather than the ends. Similarly, Figure 3(E) shows a case where a mistake was corrected on a kinked helix where that particular sidechain conformation would be disfavored in the middle of a helix due to clashes.

As a result of the torsional correction protocol addressing pathologies such as those shown in Figure 3, the error rate in the one‐at‐a‐time recovery in native context RTMin protocol shows a small but statistically significant (Table 1) decrease in rotamer recovery in monomeric proteins. We see similar improvements in rotamer recovery at protein–protein interfaces. Figure 4 illustrates this error reduction upon application of the sidechain torsional corrections. In predicting core rotamers, the error rate decreases from 4.60% to 4.33%; for residues at protein–protein interfaces, the error rate decreases from 8.50% to 8.20%.

Table 1.

An Overview of Improvements in Structure Prediction Tasks Using the Modified Sidechain Torsional Potential Following Our Correction Scheme

Test	Test set	Default (std. dev.)	Modified + corrected (std. dev.)
RTMin rotamer recovery % error rate	Core	4.60 (0.047)	4.15 (0.047)
RTMin rotamer recovery % error rate	Interface	8.50 (0.33)	8.14 (0.31)
PackRotamers rotamer recovery % error rate	Core	14.72 (0.082)	14.20 (0.081)
PackRotamers rotamer recovery % error rate	Interface	21.06 (0.48)	20.52 (0.47)
Kullback–Leibler divergence from native rotamer distribution	Core	0.0080 (0.00032)	0.0052 (0.00023)
	Interface	0.0221 (0.0049)	0.0157 (0.0042)
Sequence recovery %	Seq recov	44.7 (0.22)	44.8 (0.24)
Sequence profile recovery %	Seq recov	74.3 (0.16)	74.6 (0.19)
Decoy discrimination	Decoy disc	−0.1708	−0.1802

Open in a new tab

All listed tests show a statistically significant difference between default and modified + corrected parameters at a P‐value of 0.05 for all bootstrapped assessments.

Improvements in rotamer recovery following removal of double‐counting interactions. (Top) Core and interface rotamer recovery results are both improved after application of corrections (see corrected bar). Furthermore, after addition of physical intra‐residue score terms and re‐applying corrections (see modified + corrected bar), we see further reduction in errors. (Bottom) Turning off the statistical torsional potential, we see worsening of rotamer recovery. This worsening is only minimally improved if intra‐residue physical terms are enabled.

To reduce the effect of double‐counting interactions, all intra‐residue physical interactions are disabled or are significantly down weighted within Rosetta. However, our correction scheme allows these interactions to be combined with statistical potentials without large double‐counting side effects. By performing enumerative correction to remove double counting arising from the use of physical intra‐residue energies, then using the physical and corrected energy terms in combination, we see a lower error rate than overall. Combining this with a loosening of the harmonic constraints governing off‐rotamer deviations (as some of this is accounted for by the physical intra‐residue energy terms) results in further improvement for a core rotamer recovery error rate of 4.15% and interface rotamer recovery rate of 8.14% (Fig. 4). Figure 5(A,B) illustrates that these improvements are seen across nearly all residue identities, although not all residues show statistically significant improvement in interface rotamer recovery (Supporting Information Table S1). Using only these intra‐residue energetic terms, without the corresponding torsional potential corrections, shows a significantly worse recovery than using the statistical potential alone (Fig. 4, control panel).

Rotamer distributions following sidechain prediction show improved agreement to native structures. A comparison of the default Rosetta energy function versus our modified + corrected energy function under two different evaluation metrics. (Top) Per‐residue rotamer recovery results for core (a) and interface (b) sidechains. (Bottom) Per‐residue Kullback–Leibler (KL) divergence of predicted sidechain rotamers versus native sidechain rotamers. In all plots, orange indicates the default Rosetta energy function, while blue corresponds to our modified + corrected energy function. Most, but not all, residues show an improvement under both metrics.

We next wanted to assess the applicability of these corrections to more general tasks in protein structure prediction and protein design. Table 1 shows the performance on several other benchmark protein structure modeling tasks. All protocols use the modified energy function with physical intra‐residue terms present. In the first test, PackRotamers, we predict all sidechain rotamers simultaneously given a native protein backbone. Evaluation of this test uses the same density‐agreement metric described earlier to assess the correctness of rotamer predictions. In contrast to RTMin, PackRotamers assesses cooperative influences in sidechain conformation sampling and presents a more realistic task for sidechain rotamer prediction. Using the same core and interface sidechain sets from the RTMin assessment, we see similar reductions in error (from 14.72% to 14.20% on the core recovery, and from 21.06% to 20.52% on the interface recovery), albeit at a higher overall error rate.

Using this test, we also wanted to see whether distributions of predicted rotamers better matched native rotameric distributions. Using the predicted conformations from the PackRotamers test, we compared the Kullback–Leibler divergence of predicted rotamers to that of the native structures in the set. We split these results into protein core and protein interface predictions, and compare the distributions both before and after application of our corrections. The results are illustrated in Table 1 as well as Figure 5(C,D). On average, the rotamer distribution is more native‐like, though a few residue types show a higher divergence from the native distribution.

The gold standard for benchmarking Rosetta energy function modifications is the “landscape relax” benchmark,19 in which pregenerated conformations covering a significant portion of conformational space are refined with the current energy function, before being evaluated for native‐like structure discrimination. In this test, we start from a large set of pre‐generated conformations, representing local minima of the energy function over all of topology space. To evaluate the discriminative power of an energy function, each conformation is first refined in the target energy function using the FastRelax protocol.20 Then, the discrimination power is quantified by calculating the normalized energy gap between the lowest energy native and non‐native models. Larger energy gaps between near‐native and non‐native models result in a better discrimination score, and consequently, a more discriminative energy function. In contrast to rotamer recovery protocols, the FastRelax protocol permits backbone degrees of freedom to be minimized, provides a more complete picture of energy function correctness over a wide range of conformational space. The average discrimination score, normalized by the 5th to 95th percentile spread in energies, is −0.1708 for the default scorefunction, and improves to −0.1802 with our corrections (Fig. 6).

Native structure discrimination following all‐atom refinement is improved. Decoy discrimination of broadly sampled protein conformational landscapes following all‐atom refinement shows an improvement with our energy function modifications. (a) A scatterplot showing native “discrimination score” over a set of 82 target proteins, with the standard energy function on the x‐axis, and our corrected energy function on the y‐axis. More cases show an improvement in native‐structure discrimination than show a reduction (lower scores indicate a larger energy gap for near‐native structures compared to distant). (b) Three examples from the plot on the left showing the improvements in native structure discrimination: the x‐axis indicates the RMS to native, and the y‐axis indicates the all‐atom energy. The three plots on the left show the energy landscapes for the standard energy function; the three on the right show that for our corrected function.

Finally, we test the ability of the energy function to recover native sequence. Starting from a subset of the structures used in rotamer recovery, we predict the native residue identity of each position, assuming the residue identity, but not rotamer, for all other residues in the protein is known. In this test, our corrections minimally affect sequence recovery, with corrected and uncorrected scorefunctions achieving 44.8% and 44.9% sequence recovery, respectively. If we consider a predicted identity correct if it matches the native sequence, or any residue identity taken in evolutionarily related sequences, we see an improvement from 74.5% to 74.7% sequence recovery by our corrections.

Discussion

Double counting of energetic and statistical terms is a known challenge in biomolecular forcefields that combine knowledge‐based potentials with physical potentials. We have shown a single‐pass approach for correcting double counting interactions in statistical torsional potentials. Applying this scheme to the sidechain torsional potential of the Rosetta energy function, we identify several double‐counting pathologies, including double‐counting interactions caused from sidechain/backbone electrostatic interactions, as well as those caused by steric interactions in helices, where rotamers likely to clash in the middle of a helix are overly disfavored in helix termini or kinked helices.

We further demonstrated that we could further improve sidechain conformation prediction by reincorporating physical intra‐residue terms, which were previously disabled to avoid double counting with the statistical potential. While physical score terms suffer from parameterization issues and don't model all possible molecular interactions, they have finer granularity and more generalizability. By allowing knowledge‐based potentials to supplement physical potentials rather than displacing them, the energy function becomes more accurate and useful.

Most importantly, this method enables further energy function improvements by providing a method by which sidechain torsional corrections may be automatically corrected following other score function changes. For instance, the electrostatic term may be artificially dampened due to these double‐counting interactions; the method described in this manuscript might allow it to be further strengthened. Furthermore, these corrections might be applied to other statistical score terms in the Rosetta energy function. For example, a similar correction strategy could be employed to correct backbone torsional potentials, or the knowledge‐based hydrogen‐bonding potential. The approach of this manuscript should prove generally valuable for combining statistical and physical terms in a biomolecular forcefield.

Methods

Dataset

The benchmarks in this paper primarily made use of the Richardson Top8000 database21 of high‐resolution crystal structures. For the protein core benchmarks, four thousand structures with electron density maps were taken from the Top8000 set and evenly split into a training set and a test set. The fragment sampling protocol used the whole Top8000 set, excluding the two thousand proteins from the test set.

All protein interface benchmarks use sidechains involved in protein–protein interfaces (Cβ–Cβ distances < 10 Å) from 153 protein complexes with high‐resolution crystal structures (2 Å or better). They were selected by first identifying high‐resolution heterodimeric with deposited structure factors, removing antibody structures (due to the large number of antibody structures present, we wanted to avoid biasing the results to them too much) and removing highly intertwined structures. The decoy discrimination benchmark used the set of 84 small monomeric proteins described by Conway et al.19 Sequence recovery used a subset (∼500 structures) of the Top8000 set, chosen at random.

Statistical rotamer library corrections

The double counting correction protocol makes corrections to the statistical rotamer library by using Rosetta energies, which were computed and averaged across a large set of proteins. Fragments were mined from high‐resolution crystal structures, with the secondary structure assignment of the central residue (as assessed by DSSP22) used to determine the size and the context of each fragment used in energy evaluation: if the central residue was helical, then a 9 residue fragment centered on the target residue was used; if the central residue was a loop, then a 5 residue fragment was used; and if the central residue was a strand, then a 5 residue fragment was taken along with all other residues making backbone‐to‐backbone hydrogen bonds with this fragment (for a total of up to 15 residues).

For each fragment, the center residue maintained its native identity, while all non‐proline and non‐glycine residues in the fragment were mutated to alanine. For prolines in the non‐central position, the rotamer of the native structure was used and no rotamer sampling was carried out later. Any fragments with large internal clashes (defined as having Rosetta repulsive energy greater than 2.0) were discarded. This provides a set of fragments for each residue type, modeling all the local environments around that residue type observed in high‐resolution crystal structures.

To compute the correction factors for a given residue type, for each fragment all rotamers of the amino acid type of the central residue were built and minimized using a modified scorefunction that excluded the rotameric energy term. Semi‐rotameric sidechain torsions were sampled on a 10‐degree grid. During minimization, sidechain torsions were constrained with harmonic constraints penalizing off‐rotameric conformations, but the energy of these constraints was not used in the final energy calculations following minimization. For non‐rotameric sidechain torsions, no constraint was applied; instead if minimization moved the structure out of the 10‐degree bin, it was reset to the bin center. For a given fragment, rotamer energies were capped to a maximum of three Rosetta energy units above the best‐scoring rotamer:

E_{frag}^{*} (χ_{1.4} | ψ, ϕ) = \min {E_{frag}^{} (χ_{1.4} | ψ, ϕ), \min_{χ_{1.4}^{*}} (E_{frag}^{} (χ_{1.4}^{*} | ψ, ϕ)) + C}

Here, E _frag is the physical energy of a single fragment with the given torsions at the center residue. $E_{frag}^{*}$ is the “capped” value, that is at most, a fixed value C energy units worse than the best‐scoring rotamer of that particular fragment. We found that a value of C = 3 empirically works well.

For a given rotamer, residue type, and Ψ/Φ torsion bin combination, these energies were averaged across all fragments to compute the correction factor:

\bar{E} (χ_{1.4} | ψ, ϕ) = \frac{1}{N} \sum_{frag} E_{frag}^{*} (χ_{1.4} | ψ, ϕ)

Finally, we used the computed Rosetta energies to make adjustments to the probabilities in the rotamer library. First, to account for the downweighting of the rotamer library scoring term by 0.7 during normal Rosetta use, we scaled all Rosetta energies up by 1/0.7. We next computed Boltzmann probabilities for all rotamers (relative to all other rotamers of the same residue type within a given Ψ/Φ bin) by taking the exponential of the negative rotamer energies and normalizing across all rotamers in each Ψ/Φ bin.

\bar{P} (χ_{1.4} | ψ, ϕ) = \frac{e^{- \bar{E} (χ_{1.4} | ψ, ϕ)}}{\sum_{χ_{1.4}^{*}} e^{- \bar{E} (χ_{1.4}^{*} | ψ, ϕ)}}

To avoid overfitting, particularly in Ψ/Φ bins with small numbers of representative fragments, a Gaussian convolution was applied to every rotamer across Ψ/Φ space with 10 degree standard deviation. A Gaussian convolution was also applied to the probability distributions for semi‐rotameric torsional distributions using a standard deviation of 5 degrees. These corrected probabilities were renormalized, giving us a sidechain torsional probability distribution implied by the physical model.

Correcting the rotamer library is then a matter of dividing out the Rosetta‐energy rotamer probabilities from the rotamer library probabilities.

P_{rot}^{*} (χ_{1.4} | ψ, ϕ) = \frac{P_{rot}^{} (χ_{1.4} | ψ, ϕ)}{\bar{P} (χ_{1.4} | ψ, ϕ)}

Larger corrections are applied to rotamers that are strongly favored or disfavored by the physical energy terms. If the physical energy terms are equal across all rotamers of a given residue type, then the same correction is applied to all rotamers, and after renormalization the original distribution will remain unchanged. Rotamer libraries containing the renormalized P _corrected are written and read in by Rosetta in the same manner as the original rotameric distributions.13, 23

Rotamer recovery

Rotamer recovery is a fixed backbone sidechain rebuild protocol that assesses Rosetta's ability to sample and evaluate potential sidechain conformations. Two approaches were taken in this paper—one‐at‐a‐time recovery in native context (RTMin14) and combinatorial all‐at‐once recovery (PackRotamers24). While both protocols sample all available rotamers chis at their mean value and ±1 standard deviation, RTMin also minimizes the sampled rotamer. Following visual inspection, conformation recovery was defined as a difference in electron density correlation between native and predicted conformations of 0.13 or lower.

Recovery rates are measured on a set of 2000 structures from the Top8000 set using electron density data. Only well‐resolved sidechains, defined as a minimum electron density correlation of 0.72 (that is the correlation between density computed from the deposited structure, and a 2mF _o − dF _c density map computed from the deposited data) and a maximum b‐factor of 30, were considered in recovery assessments. Additionally, sidechains with multiple PDB conformations and potential crystal contacts were eliminated from consideration. Residues were eliminated as potential crystal contacts if at some position, any rotamer would potentially contact any symmetry neighbor atom (within 3.5 Å for polar atoms and 2 Å for nonpolar). While RTMin only rebuilt and assessed well‐resolved sidechains, PackRotamers rebuilt all sidechains and then assessed only the well‐resolved sidechains.

Bootstrapping was used to determine statistical significance of the rotamer recovery error rates. For each assessment, ten thousand equal sized replicates were created by performing sampling with replacement. From the replicates, the mean and standard deviation of the error rate was computed. An independent two‐sample t‐test was used to compare the default score function to the experimental score functions (Table 1). Furthermore, the PackRotamers protocol was performed five times and the results were merged to create the data set from which replicates were drawn.

Kullback–Leibler divergence

The native distribution was taken directly from the same data set used for rotamer recovery. The distribution consisted of counts for each unique rotamer. Nonrotameric chis are not considered; only the rotameric chis preceding the nonrotameric chis are counted. For each parameter set, the PackRotamers protocol was performed and counts gathered from the ensuing distribution. Once the counts were obtained for each distribution, the standard Kullback–Leibler divergence calculation was computed for both per‐residue counts and overall counts where the counts were normalized to sum to 1.

S = \sum_{r \in rotset} [predicted_count (r) * \log (\frac{predicted_count (r)}{native_count (r)})]

The same approach to bootstrapping the PackRotamer results to determine statistical significance was used here.

Batchrelax decoy discrimination

Decoy discrimination quantifies the discriminatory ability of an energy function by relaxing a large set of decoy structures for each target in a set of 82 structures and comparing the energies of the near‐native decoys to the energies of non‐native decoys.19 A discrimination score is then calculated based on correlation between score and RMSD where a separate discrimination score is computed for each protein using normalized energies and then averaged. Each protein discrimination score is an average of seven separate discrimination scores taken at the RMS values r = [1,1.5,2,2.5,3,4,6], defined as the difference in normalized energy between the lowest energy structure above and below the boundary r.

S = \sum_{r \in {1, 1.5, 2, 2.5, 3, 4, 6}} \min_{i, RMS (i) \in [0, r]} E_{i} - \min_{i, RMS (i) \in (r, \infty]} E_{i}

For this paper, we reduced the amount of compute time required down to 7.5% of original by making a few optimizations. First, the input set was reduced from 4500 structures to 1000 where the RMS range was still evenly represented. Next, Batchrelax was performed in place of Fastrelax. Batchrelax is a competitive optimization algorithm that culls the population of structures each time the ramp–repack–min cycle is repeated. Each batch has 20 structures and is grouped by RMSD to native. The culling operation eliminates the 20% highest‐energy structures of the batch. For 1000 structures per PDB target, we performed Batchrelax on 50 batches and merged the results before computing the discrimination score.

Sequence recovery

Sequence recovery used a subset of the test set used in rotamer recovery tests. A subset of 442 structures was randomly selected for runtime considerations. For every amino acid in these structures, the residue identity and all sidechain conformations were predicted using four cycles of Rosetta's repack and sidechain minimization, with repulsive energies ramped following the schedule of Rosetta relax. For both default and the corrected + modified energy functions, reference weights were refit specifically for this task using a similar strategy by Leaver‐Fay et al.23; reported sequence recovery values are after reference weight fitting. For reporting sequence profile recovery, we used hhblits25 with an e‐value cutoff of 10⁻³ to identify homologous sequences. A profile recovery match was recorded if any homologous sequence had a residue of the predicted identity at a target position. Bootstrapping was also performed to determine statistical significance of these results, but only 32 replicates were obtained since each replicate requires refitting of reference weights before sequence recovery and sequence profile recovery could be reported.

Availability

The corrected statistical potential is available in Rosetta releases after week X , 2016. It may be accessed by providing the flags after week 24, 2016:

‐dun10_dir rotamer/corrections_conway2016
‐score::weights corrections_conway2016.wts

Additionally, an application, torsional_potential_correction is available as part of the Rosetta software suite. Documentation on its usage is added to the Rosetta documentation. It may be applied to alternate scorefunctions, to allow computation of customized corrected libraries.

Supporting information

Supporting Information

Click here for additional data file.^{(31.9KB, docx)}

References

1. Brooks BR, Brooks CL 3rd, Mackerell AD Jr, Nilsson L, Petrella RJ, Roux B, Won Y, Archontis G, Bartels C, Boresch S, Caflisch A, Caves L, Cui Q, Dinner AR, Feig M, Fischer S, Gao J, Hodoscek M, Im W, Kuczera K, Lazaridis T, Ma J, Ovchinnikov V, Paci E, Pastor RW, Post CB, Pu JZ, Schaefer M, Tidor B, Venable RM, Woodcock HL, Wu X, Yang W, York DM, Karplus M (2009) CHARMM: the biomolecular simulation program. J Comput Chem 30:1545–1614. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Salomon‐Ferrer R, Case DA, Walker RC (2013) An overview of the Amber biomolecular simulation package. Wiley Interdiscip Rev Comput Mol Sci 3:198–210. [Google Scholar]
3. Christen M, Hünenberger PH, Bakowies D, Baron R, Bürgi R, Geerke DP, Heinz TN, Kastenholz MA, Kräutler V, Oostenbrink C, Peter C, Trzesniak D, van Gunsteren WF. (2005) The GROMOS software for biomolecular simulation: GROMOS05. J Comput Chem 26:1719–1751. [DOI] [PubMed] [Google Scholar]
4. Jorgensen WL, Maxwell DS, Tirado‐Rives J (1996) Development and testing of the OPLS all‐atom force field on conformational energetics and properties of organic liquids. J Am Chem Soc 118:11225–11236. [Google Scholar]
5. Zhou H, Skolnick J (2011) GOAP: a generalized orientation‐dependent, all‐atom statistical potential for protein structure prediction. Biophys J 101:2043–2052. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Shen M, Sali A (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci 15:2507–2524. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Kuhlman B, Dantas G, Ireton G, Varani G, Stoddard B, Baker D (2003) Design of a novel globular protein fold with atomic‐level accuracy. Science 21:1364–1368. [DOI] [PubMed] [Google Scholar]
8. O'Meara MJ, Leaver‐Fay A, Tyka MD, Stein A, Houlihan K, DiMaio F, Bradley P, Kortemme T, Baker D, Snoeyink J, Kuhlman B (2015) Combined covalent‐electrostatic model of hydrogen bonding improves structure prediction with Rosetta. J Chem Theory Comput 11:609–622. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Song Y, Tyka M (2011) Structure‐guided forcefield optimization. Proteins 79:1898–1909. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA (1995) A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J Am Chem Soc 117:5179–5197. [Google Scholar]
11. Damm W, van Gunsteren WF (2000) Reversible peptide folding: dependence on molecular force field used. J Comput Chem 21:774–787. [Google Scholar]
12. MacKerell AD, Feig M, Brooks CL (2004) Improved treatment of the protein backbone in empirical force fields. J Am Chem Soc 126:698–699. [DOI] [PubMed] [Google Scholar]
13. Shapovalov MV, Dunbrack RL (2011) A smoothed backbone‐dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions. Structure 19:844–858. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Wang C, Schueler ‐, Furman O Baker D (2005) Improved side‐chain modeling for protein–protein docking. Protein Sci 14:1328–1339. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Richardson J, Richardson D (1988) Amino acid preferences for specific locations at the ends of alpha helices. Science 242:1624. [DOI] [PubMed] [Google Scholar]
16. Doig A, Baldwin R (1995) N‐and C‐capping preferences for all 20 amino acids in α‐helical peptides. Protein Sci 4:1325–1336. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Stapley BJ, Doig AJ (1997) Free energies of amino acid side‐chain rotamers in alpha‐helices, beta‐sheets and alpha‐helix N‐caps. J Mol Biol 272:456–464. [DOI] [PubMed] [Google Scholar]
18. Ting D, Wang G, Shapovalov M, Mitra R, Jordan M, Dunbrack RL (2010) Neighbor‐dependent Ramachandran probability distributions of amino acids developed from a hierarchical Dirichlet process model. PLoS Comput Biol 6:e1000763. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Conway P, Tyka MD, DiMaio F, Konerding DE, Baker D (2014) Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein Sci 23:47–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Das R, Baker D (2008) Macromolecular modeling with rosetta. Annu Rev Biochem 77:363–382. [DOI] [PubMed] [Google Scholar]
21. Richardson JS, David C (2012) More data, more dimensions, more uses. Biomolecular Forms and Functions: A Celebration of 50 Years of the Ramachandran Map, 46–61. [Google Scholar]
22. Kabsch W, Sanders C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers 22:2577–2637. [DOI] [PubMed] [Google Scholar]
23. Leaver‐Fay A, O'Meara MJ, Tyka M, Jacak R, Song Y, Kellogg EH, … Kuhlman B. (2013) Chapter Six: Scientific benchmarks for guiding macromolecular energy function improvement In: A. E. K. B. T.‐M. in Enzymology (Ed.), Methods in Protein Design Academic Press, Vol. Volume 523, pp. 109–143. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Kuhlman B, Baker D (2000) Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci USA 97:10383–10388. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Remmert M, Biegert A, Hauser A, Söding J (2001) HHblits: lightning‐fast iterative protein sequence searching by HMM‐HMM alignment. Nat Methods 9:173–175. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Click here for additional data file.^{(31.9KB, docx)}

[pro2956-bib-0001] 1. Brooks BR, Brooks CL 3rd, Mackerell AD Jr, Nilsson L, Petrella RJ, Roux B, Won Y, Archontis G, Bartels C, Boresch S, Caflisch A, Caves L, Cui Q, Dinner AR, Feig M, Fischer S, Gao J, Hodoscek M, Im W, Kuczera K, Lazaridis T, Ma J, Ovchinnikov V, Paci E, Pastor RW, Post CB, Pu JZ, Schaefer M, Tidor B, Venable RM, Woodcock HL, Wu X, Yang W, York DM, Karplus M (2009) CHARMM: the biomolecular simulation program. J Comput Chem 30:1545–1614. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0002] 2. Salomon‐Ferrer R, Case DA, Walker RC (2013) An overview of the Amber biomolecular simulation package. Wiley Interdiscip Rev Comput Mol Sci 3:198–210. [Google Scholar]

[pro2956-bib-0003] 3. Christen M, Hünenberger PH, Bakowies D, Baron R, Bürgi R, Geerke DP, Heinz TN, Kastenholz MA, Kräutler V, Oostenbrink C, Peter C, Trzesniak D, van Gunsteren WF. (2005) The GROMOS software for biomolecular simulation: GROMOS05. J Comput Chem 26:1719–1751. [DOI] [PubMed] [Google Scholar]

[pro2956-bib-0004] 4. Jorgensen WL, Maxwell DS, Tirado‐Rives J (1996) Development and testing of the OPLS all‐atom force field on conformational energetics and properties of organic liquids. J Am Chem Soc 118:11225–11236. [Google Scholar]

[pro2956-bib-0005] 5. Zhou H, Skolnick J (2011) GOAP: a generalized orientation‐dependent, all‐atom statistical potential for protein structure prediction. Biophys J 101:2043–2052. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0006] 6. Shen M, Sali A (2006) Statistical potential for assessment and prediction of protein structures. Protein Sci 15:2507–2524. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0007] 7. Kuhlman B, Dantas G, Ireton G, Varani G, Stoddard B, Baker D (2003) Design of a novel globular protein fold with atomic‐level accuracy. Science 21:1364–1368. [DOI] [PubMed] [Google Scholar]

[pro2956-bib-0008] 8. O'Meara MJ, Leaver‐Fay A, Tyka MD, Stein A, Houlihan K, DiMaio F, Bradley P, Kortemme T, Baker D, Snoeyink J, Kuhlman B (2015) Combined covalent‐electrostatic model of hydrogen bonding improves structure prediction with Rosetta. J Chem Theory Comput 11:609–622. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0009] 9. Song Y, Tyka M (2011) Structure‐guided forcefield optimization. Proteins 79:1898–1909. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0010] 10. Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA (1995) A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J Am Chem Soc 117:5179–5197. [Google Scholar]

[pro2956-bib-0011] 11. Damm W, van Gunsteren WF (2000) Reversible peptide folding: dependence on molecular force field used. J Comput Chem 21:774–787. [Google Scholar]

[pro2956-bib-0012] 12. MacKerell AD, Feig M, Brooks CL (2004) Improved treatment of the protein backbone in empirical force fields. J Am Chem Soc 126:698–699. [DOI] [PubMed] [Google Scholar]

[pro2956-bib-0013] 13. Shapovalov MV, Dunbrack RL (2011) A smoothed backbone‐dependent rotamer library for proteins derived from adaptive kernel density estimates and regressions. Structure 19:844–858. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0014] 14. Wang C, Schueler ‐, Furman O Baker D (2005) Improved side‐chain modeling for protein–protein docking. Protein Sci 14:1328–1339. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0015] 15. Richardson J, Richardson D (1988) Amino acid preferences for specific locations at the ends of alpha helices. Science 242:1624. [DOI] [PubMed] [Google Scholar]

[pro2956-bib-0016] 16. Doig A, Baldwin R (1995) N‐and C‐capping preferences for all 20 amino acids in α‐helical peptides. Protein Sci 4:1325–1336. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0017] 17. Stapley BJ, Doig AJ (1997) Free energies of amino acid side‐chain rotamers in alpha‐helices, beta‐sheets and alpha‐helix N‐caps. J Mol Biol 272:456–464. [DOI] [PubMed] [Google Scholar]

[pro2956-bib-0018] 18. Ting D, Wang G, Shapovalov M, Mitra R, Jordan M, Dunbrack RL (2010) Neighbor‐dependent Ramachandran probability distributions of amino acids developed from a hierarchical Dirichlet process model. PLoS Comput Biol 6:e1000763. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0019] 19. Conway P, Tyka MD, DiMaio F, Konerding DE, Baker D (2014) Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein Sci 23:47–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0020] 20. Das R, Baker D (2008) Macromolecular modeling with rosetta. Annu Rev Biochem 77:363–382. [DOI] [PubMed] [Google Scholar]

[pro2956-bib-0021] 21. Richardson JS, David C (2012) More data, more dimensions, more uses. Biomolecular Forms and Functions: A Celebration of 50 Years of the Ramachandran Map, 46–61. [Google Scholar]

[pro2956-bib-0022] 22. Kabsch W, Sanders C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen‐bonded and geometrical features. Biopolymers 22:2577–2637. [DOI] [PubMed] [Google Scholar]

[pro2956-bib-0023] 23. Leaver‐Fay A, O'Meara MJ, Tyka M, Jacak R, Song Y, Kellogg EH, … Kuhlman B. (2013) Chapter Six: Scientific benchmarks for guiding macromolecular energy function improvement In: A. E. K. B. T.‐M. in Enzymology (Ed.), Methods in Protein Design Academic Press, Vol. Volume 523, pp. 109–143. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0024] 24. Kuhlman B, Baker D (2000) Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci USA 97:10383–10388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro2956-bib-0025] 25. Remmert M, Biegert A, Hauser A, Söding J (2001) HHblits: lightning‐fast iterative protein sequence searching by HMM‐HMM alignment. Nat Methods 9:173–175. [DOI] [PubMed] [Google Scholar]

PERMALINK

Improving hybrid statistical and physical forcefields through local structure enumeration

Patrick Conway

Frank DiMaio

Abstract

Introduction

Results

Figure 1.

Figure 2.