doi: /j.jmb J. Mol. Biol. (2006) 357,

Size: px

Start display at page:

Download "doi: /j.jmb J. Mol. Biol. (2006) 357,"

Chester Johns
6 years ago
Views:

1 doi: /j.jmb J. Mol. Biol. (2006) 357, Efficient Restraints for Protein Protein Docking by Comparison of Observed Amino Acid Substitution Patterns with those Predicted from Local Environment Vijayalakshmi Chelliah, Tom L. Blundell and Juan Fernández-Recio* Department of Biochemistry University of Cambridge 80 Tennis Court Road, Cambridge CB2 1GA, UK *Corresponding author The discovery that the functions of most eukaryotic gene products are mediated through multi-protein complexes makes the prediction of protein interactions one of the most important current challenges in structural biology. Rigid-body docking methods can generate a large number of alternative candidates, but it is difficult to discriminate the near-native interactions from the large number of false positives. Many different scoring functions have been developed for this purpose, but in most cases, experimental and biological information is still required for accurate predictions. We explore here the use of evolutionary restraints in evaluating rigid-body docking geometries. In order to identify potential interface residues we identify functional residues based on the comparison of observed amino acid substitutions with those predicted from local environment. The interface residues identified by this method are correctly located in 85% of the cases. These predicted interface residues are used to define distance restraints that help to score rigid-body docking solutions. We have developed the pydockrst software, which uses the percentage of satisfied distance restraints, together with the electrostatics and desolvation binding energy, to identify correct docking orientations. This methodology dramatically improves the docking results when compared to the use of energy criteria alone, and is able to find the correct orientation within the top 20 docking solutions in 80% of the cases. q 2006 Elsevier Ltd. All rights reserved. Keywords: protein protein docking; distance restraints; environmentspecific substitution tables Introduction Most biological processes, such as signal transduction, gene expression control, enzyme inhibition, antibody antigen recognition and even the assembly of multi-domain proteins, involve the formation of specific multi-protein complexes. In spite of their biological and therapeutical importance, only slow progress has been made in defining the structures of such multi-protein assemblies. Present address: Dr J. Fernandez-Recio, Molecular Modelling and Bioinformatics, Parc Cientific de Barcelona, C/Joseph Samitier 1-5, Barcelona, Spain. Abbreviations used: ESST, environment-specific substitution table; AIR, ambiguous interaction restraint; FGF, fibroblast growth factor; FGFR, fibroblast growth factor receptor; CAPRI, critical assessment of predicted interaction(s); ASA, accessible surface area. address of the corresponding author: juan@mmb.pcb.ub.es Indeed, a detailed knowledge of their structures at atomic level by X-ray crystallography or NMR proves to be challenging in most cases. As a consequence, the number of computer tools to generate such assemblies by docking the structures of individual subunits is growing fast. 1 7 Although the ultimate goal is fully automatic docking prediction, current approaches generate a large number of alternative docked solutions (false positives), where the near-native structures are difficult to discriminate. That is why, when analysing the results of a docking run, it is often necessary to include all available biological information and mutational data about the binding mode, which helps to detect the near-native solutions with more accuracy. Indeed, in the recent CAPRI experiment (critical assessment of predicted interactions ), many of the correct predictions were made as a /$ - see front matter q 2006 Elsevier Ltd. All rights reserved.

2 1670 Protein Protein Docking Restraints using ESSTs result of the inclusion of available biochemical and mutational data. In most cases, the inclusion of available information to sort, select or filter docking solutions is done by hand, quite often after a visual analysis of the docking solutions, on a case-to-case basis. The program HADDOCK (high ambiguity driven docking), developed by Dominguez and co-workers, 8 has been recently reported to use automatically biochemical and/or biophysical interaction data, such as chemical shift perturbation data obtained from NMR titration experiments or mutagenesis, to drive the docking predictions. In this approach, the information on the interacting residues is introduced as ambiguous interaction restraints (AIRs), classifying such residues into two types: active and passive. For instance, when using NMR titration data, the authors of HADDOCK defined the active residues as all solvent-accessible residues showing a significant chemical shift perturbation upon complex formation. The passive residues corresponded to the solvent-accessible residues that showed a less significant chemical shift perturbation and/or that were surface neighbours of the active residues. The total energy function used to calculate the structures was a sum of electrostatic, van der Waals, and AIR energy terms. This approach has been shown to integrate optimally restraints from experimental data in docking predictions, as the performance of HADDOCK in the CAPRI competition confirms. However, biochemical or mutational information about protein protein interactions is not always available, and in any case, it is difficult to apply at a large, proteomics scale. An additional source of information comes from sequence conservation that can be derived from multiple sequence alignments, based on the assumption that interface residues will be more conserved than the rest of surface residues. In general, it is difficult to identify the reasons for residue conservation, and in particular, to assess which residues are conserved because they are part of a functional interface. For instance, Mirny and Shakhnovich 9 performed an analysis on the molecular evolution of five of the most populated protein folds: immunoglobulin fold; oligonucleotide-binding fold; Rossmann fold, alpha/beta plait; and TIM barrels, in order to distinguish between functional and structural reasons for amino acid conservation. Moreover, several groups have reported the identification of interface residues from sequence and structural conservation in certain protein families. 10,11 These predicted functional sites have been used to filter the solutions generated by protein protein docking. 12 However, such a protocol was applied only to protease-inhibitor protein protein complexes, where the enzyme active site, which is more likely to be detected from sequence conservation analysis, is located in the protein protein interface. Actually, a recent analysis in a larger and more varied set of protein protein interfaces suggested that interface conservation is not sufficiently different from other surface patches to allow prediction of the interface by conservation alone. 13 The identification of interface residues from evolutionary information in non enzyme inhibitor protein protein interactions is thus a challenging task. The degree of conservation of amino acid residues has been shown to be strongly dependent on the environment in which they occur in the folded protein, and substitution tables that give the likely replacements of amino acids in particular local environments have been derived. 14,15 These environment-specific substitution tables (ESSTs) have been recently used to develop Crescendo, a method to distinguish restraints placed on substitutions due to protein structure from restraints deriving from functions mediated by interactions with other molecules. 16 For each position, a divergence score was defined by the difference between the predictions from the environmentspecific substitution tables and the overall amino acid substitution pattern. The clusters of high scoring alignment positions apparently subjected to these additional restraints in evolution correlated well with the functional sites in proteins defined by experimental methods. Crescendo was able to identify functional sites in a set of well-characterised protein families. We now report the application of this functional site prediction method, Crescendo, to the detection of protein protein interaction sites involved in mediating functions and the selection of the correct docking solution from many produced using physical chemical parameters. The residues predicted in this way to be functional are used to define distance restraints in order to score protein protein orientations generated by rigid-body docking. The restraints imposed by evolutionary information are highly ambiguous, i.e. they indicate if a given residue is likely to be in the interface, but there is no information on the specific matching residues in the partner molecule. These types of restraints are similar to those used by HADDOCK, and we initially tested some of the Crescendo functional site predictions with that program. However, the restraints that can be imposed from evolutionary information are much less accurate in nature, so there is always the danger that incorrect restraints could guide the generation of incorrect docking conformations. In addition, the costly energy minimization with all the distance restraints makes the procedure computationally expensive. A more efficient approach that exploits the pydockrst program is described here. First, rigidbody docking solutions are generated by FFT-based docking and then evaluated by the binding energy optimised for rigid-body docking landscapes. We demonstrate the value of a complete scoring function, which includes an additional pseudoenergy term defined by the percentage of satisfaction of the restraints imposed by Crescendo. Our results depend on the quality of the solutions found by the FFT-based docking search, but the distance restraints imposed by the evolutionary information

3 Protein Protein Docking Restraints using ESSTs 1671 clearly help to discriminate the near-native docking solutions, especially in those cases with significant number of false positives caused by the limitations of the rigid-body approach or by a poor energy description. Results and Discussion Prediction of interface residues using environment-specific substitution tables: comparison to known binding sites We have used Crescendo, the functional site prediction method of Chelliah et al., 16 in order to identify the residues likely to be involved in a protein protein interaction. This approach uses residue conservation to predict binding sites but has the advantage of distinguishing those amino acid restraints that result from retention of structure alone in divergent evolving orthologous families from those restraints resulting from functional interactions. For this benchmark, we tried to generate a collection of hetero-complexes as varied as possible, with different examples from every category: protease inhibitor, cell cycle/signal transduction, hormone receptor, etc. We did not include antibody antigen for reasons discussed later. From the protein protein complex sets both available in the literature 4,17,18 and also previously collected by us for internal analyses, we selected those complexes that passed the following criteria in order to run the Crescendo method. Each case was required to have a sufficient number of homologous sequences, and such sequences needed to be quite divergent (i.e. less than 80% identity), since the non-functional residues can also be conserved in the short-term during evolution. Therefore, the number of sequences that were required for the method to run efficiently depended on the divergence of the sequences within the family. In our test set, the number of included sequences ranged from four in EPO receptor or ten in Raf to 128 in RGS or 135 in RhoGAP. For instance, we excluded from our analysis the cyclophilin HIV capsid complex (PDB code 1ak4) because, in spite of obtaining 78 homologous sequences for HIV capsid protein, only two of such sequences were less than 80% identical. The major challenge of the Crescendo method with respect to protein protein interactions is to choose those proteins that are true orthologues, i.e. carry out an identical function in different organisms. Paralogues by definition will have evolved by gene duplication to carry out parallel functions, and therefore will have different binding partners. A careful analysis of the sequences collected by Blast may reveal the inclusion of obvious paralogues that can be removed. However, as the functions of the proteins whose sequences were collected are often not known, we have investigated their phylogeny by generating phylogenetic trees using neighbour-joining method as implemented in TraceSuiteII, 19 and used the subgroup branch of the tree that has the largest number of sequences. In addition, most of the chosen families have crystal structures of the complex and of the unbound forms of the subunits. Finally, we did not consider those cases with known multiple active sites, since our method currently focuses only on the largest site. For this reason, we excluded the actin profilin complex (PDB code 2btf), in which the functional site predictions for actin would be mainly located in the ATP/ADP binding site. The predictions in actin would also be affected by the existence of other known binding sites to gelsolin (PDB code 1eqy, 1h1v), deoxyribonuclease I (PDB code 1atn), or tetramethylrhodamine-5-maleimide (PDB code 1j6z). Table 1 shows the binding site predictions for the selected benchmark set of proteins and their success rates when compared to the known binding sites. The predicted interaction sites were initially formed by the largest cluster of residues with high divergence score values. As can be seen in Table 1 (column 1 of Success (%)), in 60% of the cases the predicted sites were correctly located (that is, more than 50% of the predicted residues were in the interface). Interestingly, when we considered only those residues within the largest cluster and positive Z-score values (column 2), the success rate increased (predicted sites correctly located in 70% of the cases). When we considered only the solventexposed residues (relative accessible surface area, ASAR7%) within the largest cluster (column 3), the success rate was also better than the original predictions (correct predictions in 80% of the cases). Finally, when we considered only those residues that satisfied the three criteria (largest cluster, positive Z-scores, and solvent-accessible), the results improved further: the binding sites were correctly located in 85% of the cases (column 4). In the last column of Table 1 (% coverage) it is shown the percentage of real interface residues that are predicted by the residues that satisfied the three criteria (largest cluster, positive Z-scores, and solvent-accessible). Only in one case, the chymotrypsin inhibitor, did the predicted functional site fail completely. This is not unexpected as protease inhibitors have evolved to achieve highly specific binding and so there are few proteins that have evolved under the restraint of binding to a particular orthologue. Indeed, the interfaces of protease inhibitors are often characterized by hypervariability of amino acids, and some inhibitors even show greater variability in the interface residues than in the non-interacting residues. 24 Therefore, we do not expect good conservation of the interface residues among the members of the family and so the evolutionary information may not be helpful in this case. On the other hand the good predictions for chymotrypsin are explained because the inhibitor binds in the active site that has been selected for in evolution. Other cases where the functional site predictions are expected to fail are the antibody and major

4 1672 Protein Protein Docking Restraints using ESSTs Table 1. Prediction of interface residues from the comparison of observed amino acid substitution patterns and those ones predicted from local environment Name PDB subunit PDB reference complex Largest cluster c Largest cluster and Z score O0.0 d Success (%) a Largest cluster and ASA R7% e Largest cluster and Z-score and ASA f Coverage (%) b Largest cluster and Z-score and ASA f Rap 1gua_A 1gua Raf 1c1y_B 1gua Rho 1tx4_B 1tx RhoGAP 1ow3 1tx Ras 1wq1_R 1wq RasGAP 1wer 1wq Galpha 1agr_A 1agr RGS 1agr_E 1agr Piliassembly 1pdk_A 1pdk Fimbrial 1pdk_B 1pdk Gh 3hhr_A 3hhr Ghr 1hgu 3hhr EPO 1eer_C 1eer EPO receptor 1buy 1eer FGF1 1e0o_B 1e0o FGFR2 2afg 1e0o FGF2 1fq9_C 1fq FGFR1 2fgf 1fq Chymotrypsin 5cha 1cbw BPTI 1bpi 1cbw a Percentage of predicted residues that are correctly located in the known interface. b Percentage of the known interface residues that are correctly predicted. c Predicted residues as defined by Crescendo as those within the largest cluster. d Residues within the largest cluster and positive Z-score of the divergence score values. e Residues within the largest cluster and accessible (relative ASA R7%). f Residues within the largest cluster that are both accessible and have positive Z-score. histocompatibility complex (MHC) molecules of the immune system, which rely on a high mutation rate at their binding surfaces in order to interact with foreign antigens. Neither has evolved to bind a particular antigen. In addition, the functional site predictions will not work on the antigen proteins either, since they have not been subjected to evolutionary pressure in order to bind the molecules of the immune system. Poor predictive results might also arise from the existence of alternative binding sites, some of which will be detected by Crescendo. This is the case of FGF2 (41.7% of correctly predicted residues), which also binds heparin sulphate. Figure 1 shows the predicted interaction sites for the subunits of the FGF1 FGFR2, Rho RhoGAP and Ras RasGAP complexes. The largest contours of divergence score values for receptor and ligand molecules correspond quite well with the actual protein protein interfaces in the X-ray complex structures. It is remarkable that for FGF1, in spite of having the heparin binding site quite conserved, the functional site prediction method detected the larger, more significant binding site to FGFR2. Use of predicted interface residues as restraints to drive docking in HADDOCK In order to evaluate the use of our functional site predictions in protein protein docking, we have studied the interaction between the proteins FGF1 and FGFR2 with HADDOCK. 8 The coordinates of the unbound FGF1 (PDB code 2afg) and the bound FGFR2 (PDB code 1e0o) were used. Ambiguous Figure 1. Contour showing the functional site prediction for (a) FGFR2 FGF1 (PDB code 1e0o); (b) Rho RhoGAP (PDB code 1tx4); and (c) Ras RasGAP (PDB code 1wq1). The FGFR2, Rho and Ras receptor molecules are represented in green, with predictions shown in orange contour. The FGF1, Rho- GAP and RasGAP ligand molecules are represented in purple, with predictions shown in blue contour.

Protein Protein Docking Restraints using ESSTs 1673 interaction restraints (AIR) were introduced using the residues predicted as functional by Crescendo.

5 Protein Protein Docking Restraints using ESSTs 1673 interaction restraints (AIR) were introduced using the residues predicted as functional by Crescendo. 16 Active and passive residues were defined according to their Z-scores of the divergence score values. Solvent-accessible residues (relative ASA of sidechain R7%) within the largest cluster with positive Z-scores were considered as active residues and those with negative Z-scores were considered as passive. Thus, we selected nine active residues (E87, L89, E90, N92, Y94, N95, L131, L133, P134) and two passive residues (R88, H93) for FGF1; and 12 active residues (L166, H167, A168, V169, A171, V222, P223, D247, E250, R251, S252, H254) and three passive residues (P170, A172, V249) for FGFR2 (Figure 2). The program HADDOCK, with the restraints derived from Crescendo, was able to find a reasonable solution with RMSD value of 4.0 Å, which ranked 11 after the first rigid-body docking step and rose to rank 1 after the final water refinement step. This proves that Crescendo is able to detect those residues that are important for the FGF1/FGFR2 binding, and therefore, can be used to guide the docking and obtain successful results with HADDOCK. The use of functional site predictions by Crescendo as restraints for HADDOCK yielded excellent results: near-native solution was successfully ranked 1. Whereas this proves that the restraints worked well for this case, we were aware that HADDOCK was actually optimised for accurately defined restraints (from NMR or from mutational data), so it is likely that any small inaccuracy in the initial restraints might drive the docking generation to the incorrect binding modes. In addition, HADDOCK is computationally quite expensive for our purpose of evaluating the use of the functional site prediction method for docking in a significant variety of cases. For these reasons, we have developed pydockrst, a faster approach for the evaluation of docking solutions with the restraints derived from functional site predictions. Use of interaction restraints to score rigid-body docking solutions: pydockrst Figure 2. Active and passive residues used in HADDOCK for the FGF1/FGFR2 interaction, as defined using the predicted functional residues. The receptor and ligands are in blue and yellow. The active and the passive residues of the receptor are coloured green and purple, respectively. The active and the passive residues of the ligand are coloured red and cyan, respectively. pydockrst is a computer algorithm for scoring rigid-body docking solutions according to the percentage fulfilment of certain user-defined distance restraints. If all the restraints are satisfied, i.e. all restraint residues are in the 6 Å vicinity of the partner molecule, a restraint energy value of K100.0 kcal/mol is added to the total energy. If no restraint is satisfied at all, the restraint energy is 0.0 kcal/mol. Intermediate levels of restraint satisfaction will correspond to proportional restraint energy values. In order to evaluate the pydockrst protocol, we thought it would be interesting to compare its results with those of HADDOCK. Thus, we have applied pydockrst to the three examples used for HADDOCK benchmarking, using the same restraint residues. 8 For the rigid-body docking generation, we used two different FFT-based programs: ZDOCK and FTDOCK. 26 For each complex, the two sets of rigid-body docking solutions (total number of 2000 and 10,000, respectively) were evaluated with electrostatics and desolvation energy, as can be seen in Figures 3 and 4. The restraint residues were then used to establish a restraint energy value for all the docking solutions, and finally, the total scoring of the docking solutions included both the energy and the restraint values. The results of pydockrst are similar to those reported for HADDOCK for the same complexes. 8 For the EIN HPr complex, pydockrst finds a nearnative solution as the lowest-energy conformation (i.e. rank 1) both using ZDOCK and FTDOCK (Figures 3(a) and 4(a)). The near-native solutions found by these two methods were very similar (4.7 Å and 4.6 Å RMSD, respectively). The use of interaction restraint residues helped to improve the funnel landscape, but it did not make any difference

6 1674 Protein Protein Docking Restraints using ESSTs Figure 3. Rigid-body docking energy landscapes for the complexes (a) EIN/HPr, (b) E2A/HPr and (c) gp120/cd4. The 2000 docking solutions generated by ZDOCK2.1 are rescored by (i) pydockser energy, equation (2) (top diagrams); (ii) restraints from NMR data, equation (3) (middle diagrams); and (iii) pydockser energy plus restraints, computed by pydockrst (bottom diagrams). The native orientation, generated by optimally superimposing the docking receptor and ligand docking molecules onto the X-ray structure of the native complex, has been included for informative purposes (open circle). in the rank of the near-native solution found by the two methods, as the binding energy values for this solution were excellent. It is interesting, however, to note that the use of restraints improved the ranking of the complex solution that was artificially added for comparison purposes. In that case, the energy value was slightly worse, probably because of some clashing of the side-chains of the unbound subunits when they are optimally superimposed onto the complex structure. The use of restraints helped to compensate for the poor energy value of this optimal, artificial docking solution. In the case of the E2A HPr complex, the overall landscape of the docking solutions is good, but ZDOCK fails to find correct structures (Figure 3(b)). Fortunately, the artificially added complex conformation is ranked 1 after the inclusion of the restraint residues (Figure 3(b)). On the other hand, we observed that although FTDOCK generates a few more near-native solutions that have similar total scoring value to the optimal complex structure, it also generates some false positives, so that the nearnative solution (4.1 Å) ranks 4. Finally, in the case of the gp120 CD4 complex, both docking methods generate a number of near-native solutions, which are ranked 1 after total scoring (energycrestraints), as can be seen in Figures 3(c) and 4(c)). Interestingly, the use of a few restraint residues from mutational data helped to bring the near-native solution as the rank 1 for this complex. Use of predicted interface residues as restraints to score docking solutions with pydockrst Having checked the performance of pydockrst when using interaction restraints from NMR and mutational data, we proceeded to evaluate the use of restraints from the predicted interaction residues obtained by Crescendo. We used FTDOCK as the rigid-body docking generator, as it provides a larger number of docking solutions, and it can be easily integrated as part of the pydock distribution. The dramatic effect of using restraints derived from the predicted interface residues on the scoring of the rigid-body docking landscapes can be checked in Table 2. For each complex, the rank of the near-native solution before (column headed Rank by energy) and after introducing the restraints (column headed Crescendo) is shown. In most of

7 Protein Protein Docking Restraints using ESSTs 1675 Figure 4. Rigid-body docking energy landscapes for the complexes (a) EIN/HPr, (b) E2A/HPr and (c) gp120/cd4. The 10,000 docking solutions generated by FTDOCK are rescored by (i) pydockser energy, equation (2) (top diagrams); (ii) restraints from NMR data, equation (3) (middle diagrams); and (iii) pydockser energy, equation (2) plus restraints, computed by pydockrst (bottom diagrams). A native orientation, generated by optimally superimposing the receptor and ligand docking molecules onto the X-ray structure of the native complex, has been included for informative purposes (open circle). Table 2. Results from rigid-body docking and scoring by energy and distance restraints derived from sequence conservation predictions Complex PDB files Near native docking solution Rank by energycrestraints a Name PDB Receptor Ligand RMSD b energy c Rank by Crescendo Rap Raf 1gua 1guaA 1c1yB Rho RhoGAP 1tx4 1tx4B 1ow Ras RasGAP 1wq1 1wq1R 1wer Galpha RGS 1agr 1agrA 1agrE Piliassembly 1pdk 1pdkA 1pdkB Fimbrial Gh ghr 3hhr 3hhrA 1hgu Epo epor 1eer 1eerC 1buy Fgf1 fgfr2 1e0o 1e0oB 2afg Fgf2 fgfr1 1fq9 1fq9C 2fgf Chymotrypsin BPTI 1cbw 5cha 1bpi a Rank of the best near-native docking solution after scoring by energycrestraints. Restraint residues as defined by different criteria: Crescendo, automatically defined by Crescendo (largest clustercaccessiblecz-score O0); set 1, Crescendo-defined residues that are located at real interface; set 2, all the real interface residues; set 3, 50% of the real interface residues; set 4, 10% of the real interface residues; set 5, 50% of the real interface residuescsame number of non-interface residues; set 6, 10% of real interface residuescsame number of non-interface residues; set 7, 10% real interfacecninefold number of non-interface residues. b RMSD of the ligand C a atoms of the best near-native docking solution with respect to the known complex structure, after superimposing the coordinates of the receptor molecule onto the known complex structure. c Rank of the best near-native docking solution after scoring by electrostaticscdesolvation energy alone

8 1676 Protein Protein Docking Restraints using ESSTs the cases, the drop in rank of the near-native solution is striking. In some cases, FTDOCK was not able to find a rigid-body docking solution close enough to the native complex (for instance, less than 10 Å RMSD). This is a consequence of the current way of generating docking solutions based on FFT sampling. Actually, when we manually included the native orientation formed by superimposing the docking receptor and ligand molecules onto the complex X-ray structure (Table 3), such native orientation ranked much lower than the near-native docking solution in several cases (e.g. PDB codes 1pdk, 3hhr, 1eer, 1e0o), which suggests that a major limitation here is the sub-optimal sampling of the FFT-based method. The overall performance of the approach is shown in Figure 5. The number of cases where a near-native solution is found within a certain number of predictions increases dramatically when the restraints are included. As can be seen in Figure 5(a), when the binding energy (electrostaticscdesolvation) alone is considered, we find a near-native docking solution within the 50 lowest energy docking poses in only one out of the ten cases (10% success). However, when the restraints from Crescendo are included, we find a near-native docking solution within the 50 lowest scoring docking solutions in five out of ten cases (50% success). As we have previously discussed, FTDOCK did not generate correct near-native geometries for some cases. Thus, if we add the known native orientation to the pool of docking solutions and they are scored again by the binding energy (electrostaticscdesolvation), we find a nearnative or native orientation within the 20 lowest energy docking solutions in only one out of the ten cases (10% success), as can be seen in Figure 5(b). However, if we include the restraints from Crescendo in the scoring function, we find a near-native or native orientation within the 20 lowest scoring docking solutions in as many as eight out of the ten cases (80% success). We have also computed the total number of near-native docking orientations found within a certain number of low-energy docking solutions, as a way to evaluate the docking landscapes (Figure 5(c)). As can be seen, the introduction of restraints allows discrimination of a much larger number of near-native docking solutions. We explored the question of whether the quality of the restraints affects to the docking results. For the different cases, the accuracy of the restraint residues (percentage of those that are correctly located at the interface; Table 1, column Largest cluster and Z-score and ASA under Success (5)) shows very little correlation with the rank of the near-native solution (Table 2, column headed Crescendo). The coverage of the real interface by the restraint residues (Table 1, column Coverage (%)) also shows no correlation with the docking results. Except for BPTI, accuracy values range from 41.7 to 100.0, and coverage from 13.5 to It is possible that within this range of values, the restraints are good enough to get quasi-optimal results for most of the cases, being thus the small differences in the ranking of the near-native solution ultimately defined by many other factors Table 3. Results for docking and restraint energy scoring when the native docking orientation is included in the docking set Complex PDB files Native docking orientation Rank by energycrestraints a Name PDB Receptor Ligand RMSD b energy c Rank by Crescendo Rap Raf 1gua 1guaA 1c1yB Rho RhoGAP 1tx4 1tx4B 1ow Ras RasGAP 1wq1 1wq1R 1wer Galpha RGS 1agr 1agrA 1agrE Piliassembly 1pdk 1pdkA 1pdkB Fimbrial Gh ghr 3hhr 3hhrA 1hgu Epo epor 1eer 1eerC 1buy Fgf1 fgfr2 1e0o 1e0oB 2afg Fgf2 fgfr1 1fq9 1fq9C 2fgf Chymotrypsin BPTI 1cbw 5cha 1bpi a Rank of the native docking orientation after scoring by energycrestraints. Restraint residues as defined by different criteria: Crescendo, automatically defined by Crescendo (largest clustercaccessible Cz-score O0); set 1, Crescendo-defined residues that are located at real interface; set 2, all the real interface residues; set 3, 50% of the real interface residues; set 4, 10% of the real interface residues; set 5, 50% of the real interface residuescsame number of non-interface residues; set 6, 10% of real interface residuescsame number of non-interface residues; set 7, 10% real interfacecninefold number of non-interface residues. b RMSD of the ligand C a atoms of the native docking orientation with respect to the known complex structure, after superimposing the coordinates of the receptor molecule onto the known complex structure. The native docking orientation is manually generated by superimposing the docking receptor and ligand molecules onto the coordinates of the X-ray complex structure. c Rank of the native docking orientation after scoring by electrostaticscdesolvation energy alone. The native docking orientation, generated as described in the previous paragraph, is added to the pool of docking solutions and scored as the rest of them

9 Protein Protein Docking Restraints using ESSTs 1677 Figure 5. The overall performance of the reported docking approach. (a) Number of cases where a nearnative solution (!10 Å RMSD from the native structure) is found within the N number of lowest energy predictions (N as defined in abscissas). For three cases, the docking program FTDOCK did not find any solution less than 10 Å RMSD from the real complex structure. (b) This plot is similar to that of (a), but including the native orientation taken from the X-ray structure. (c) The total number of near-native solutions (!10 Å RMSD) found within the N number of lowest energy predictions. In blue is shown the performance of the initial binding energy, without any restraints. In red is shown the performance of pydockrst after including the restraints (binding affinity, binding mechanism, quality of the unbound structures, unbound-bound deformation, etc.). Therefore, in order to assess the general effect of the quality of the restraints in docking, we used different restraint sets where the percentage of real interface residues (accuracy) and their coverage of the real interface varied. Restraint set 1 is formed by the residues defined by Crescendo and located at the real interface (accuracy, 100%; coverage: varies for each case, listed in Table 1). Restraint set 2 is formed by all the real interface residues (accuracy, 100%; coverage, 100%). Restraint set 3 is formed by 50% of the real interface residues randomly selected (accuracy, 100%; coverage, 50%). Restraint set 4 is formed by 10% of the real interface residues randomly selected (accuracy, 100%; coverage, 10%). Restraint set 5 is formed by 50% of the real interface residues randomly selected (as in set 3) plus the same number of residues randomly selected from those not located at the interface (accuracy, 50%; coverage, 50%). Restraint set 6 is formed by 10% of the real interface residues randomly selected (as in set 4) plus the same number of residues selected from those not located at the interface (accuracy, 50%; coverage, 10%). Restraint set 7 is formed by 10% of the real interface residues randomly selected (as in sets 4 and 6) plus a ninefold number of non-interface residues (accuracy, 10%; coverage, 10%). In Table 2 and Figure 5 are shown the pydockrst results after applying the different restraint sets. As expected, if all interface residues are used as distance restraints in pydockrst, the docking results are excellent. Obviously, this is not a real-life situation, but it proves that the method is making optimal use of the restraints. Interestingly, if we manage to get (from sequence, evolutionary information or from experimental, mutation data) a set of restraints where at least 50% of them are located at the interface, and that represents at least 10% of the binding site (case of restraint set 6), the docking results obtained are much better than those obtained with no restraints at all. Moreover, with as low as 10% of the real binding sites, provided that there are no restraint residues outside the interface (case of restraint set 4), the docking results are certainly impressive. It seems that for success in pydockrst it is not so important to use a large number of the interface residues (high-coverage) as restraints, but to avoid too many false-positive restraints (low-accuracy), not located in the real interface. In the case of the obtained from Crescendo. The results of using different restraint sets are also shown (in parentheses are the accuracy/coverage values of the restraints, that is the percentage of restraint residues correctly located at the real interface, and the percentage of real binding residues covered by the restraint residues): in orange restraint set 1 (100/see Table 1); in black restraint set 2 (100/100); in yellow restraint set 3 (100/50); in grey restraint set 4 (100/ 10); in green restraint set 5 (50/50); in magenta restraint set 6 (50/10); in cyan restraint set 7 (10/10). See the main text for an explanation about the different restraint sets.

10 1678 Protein Protein Docking Restraints using ESSTs Figure 6. Rigid-body docking energy landscapes for (a) Galpha RGS and (b) Rho RhoGAP interaction. The 10,000 rigid-body docking solutions generated by FTDOCK are rescored by (i) the combined electrostatic and desolvation energy from pydockser (equation (2)); (ii) the restraint energy (equation (3)) derived from the functional site predictions; and (iii) the total energy (electrostaticcdesolvationcrestraint) calculated by pydockrst. A hypothetical optimal solution with native orientation, generated by superimposing the receptor and ligand docking molecules onto the X-ray structure of the native complex, has been included for informative purposes (open circle). restraints defined by Crescendo, the coverage was always over 10%, and the accuracy over 50% in most of the cases, which explains the good docking results obtained. Our analysis indicates that in a real-case scenario, a few residues known to be at the interface by evolutionary methods like Crescendo, or by other experimental data (mutational, NMR, etc.), would be enough for obtaining optimal docking results with pydockrst. We can illustrate the effect of including the restraints from Crescendo with two examples: Galpha RGS and Rho RhoGAP. As can be seen in Figure 6, the docking energy landscapes based solely on the binding energy (electrostaticsc desolvation) are far from desirable: a large number of incorrect docking solutions are found with lower energy than the near-native orientations (false positives). The use of distance restraints from the predicted binding residues (as implemented in pydockrst) helps dramatically to improve the docking energy landscapes, which now become clearly funnel-shaped favouring the near-native orientations against other docking solutions. As an example of the quality of the predictions, Figure 7 shows the lowest-scoring solution (i.e. rank 1) obtained by pydockrst for the two docking cases: Galpha RGS and Rho RhoGAP. The predictions are quite similar to the known crystallographic complex structures (PDB codes 1agr and 1tx4, respectively). The performance of pydockrst is comparable to that of HADDOCK. For the FGF1/FGFR2 interaction, the near-native solution obtained by HADDOCK (RMSD, 4.0 Å) was ranked 1, whereas the near-native solution obtained by pydockrst (RMSD, 6.41 Å) was ranked 20. However, when we included manually the native orientation taken from the complex structure, pydockrst ranked it as 1, which means that the sub-optimal ranking of the near-native solution in pydockrst was due to the poor quality of the solution generated by FTDOCK, not to the distance restraints from the functional site prediction or to the energy function. It seems that pydockrst is quite appropriate for this type of highly ambiguous distance restraints obtained from evolutionary information, with the added advantage that it is very much faster in computational time. In contrast to the work of Aloy et al., 12 we have used the predicted functional residues to introduce distance restraints instead of distance constraints. That is, we have scored each docking solution according to the percentage of satisfied distance restraints instead of removing those solutions that do not satisfy at least one distance constraint, as in the case of Aloy et al. 12 Other significant differences are in the definition of the predicted functional residues, and in the larger and more varied set of protein protein cases, which includes more than protease inhibitor complexes. In most of the cases, the structure of at least one of the subunits is taken from the complex structure. It would be ideal to use the structure of the free molecules for both subunits, but it is very difficult to

Protein Protein Docking Restraints using ESSTs 1679 Figure 7. (a) Crystal structure of Galpha RGS (1agr) superimposed with the best solution (RMSD, 3.69 Å; rank 1).

11 Protein Protein Docking Restraints using ESSTs 1679 Figure 7. (a) Crystal structure of Galpha RGS (1agr) superimposed with the best solution (RMSD, 3.69 Å; rank 1). (b) Crystal structure of Rho RhoGAP (1tx4) superimposed with the best solution (RMSD, 1.25 Å; rank 1). Red is the crystal structure and blue is the near-native solution. The restraint residues are shown in CPK, green for the receptor and yellow for the ligand. find cases where the structure of both the free subunits and the complex are available and where they have enough homologous sequences to apply Crescendo. Such cases are mostly protease inhibitor complexes, and that is why most of the studies so far have been centred on this type of complex. The cases here represent a variety of examples of biological interest for protein interaction prediction and therefore are a more representative, and probably more challenging set than used by many other researchers. Conclusions We have shown here that Crescendo, a functional sitepredictionmethodusingtheenvironment-specific substitution tables, can identify residues in protein protein interfaces with great accuracy. These residues predicted from evolutionary information can be used to define interaction restraints that are very helpful for discriminating near-native solutions generated from rigid-body docking landscapes. Although the methodology described here has certain limitations for a fully automated application (e.g. it is not only necessary to find a significant number of orthologous sequences within the protein family, but these should also be divergent in sequence), we have demonstrated the potential of the approach. The application of this methodology on a proteomic scale would help to identify potential binding sites that can be confirmed by highthroughput experiments, and would also reduce the number of docking trials in a fast search of the nearnative complex structure. Materials and Methods Prediction of interface residues using environment-specific substitution tables The environment-specific substitution tables reflect the pattern of substitutions that is characteristic of an amino acid in a particular local environment, usually defined by local secondary structure, side-chain hydrogen bonding and solvent accessibility. Further restraints, arising from the binding of substrates, cofactors, subunits and other molecules, are not taken into account while deriving the environment-specific substitution tables. Thus, the substitution patterns of the functional residues are poorly predicted by the environment-specific substitution tables. So, comparison of the substitution patterns derived from the environment-specific substitution tables with the amino acid substitutions that occur during evolution in families of orthologous proteins should identify the functional residues, since they should be more conserved than predicted from the substitution tables. The input of the functional site prediction method starts with the multiple sequence alignment (with both sequence of known structures and unknown structures). The orthologous structures and sequences of the representative structure (structure of interest) are collected as described by Chelliah et al. 16 In this respect it is essential that the sequences should be true orthologues with respect to the function predicted. The amino acid distribution termed as observed substitution pattern at each position t of the multiple sequence alignment is calculated. For each of the protein sequences of a known structure, the predicted substitution pattern of each of the 20 amino acids, at each position t is derived from the environment-specific substitution table, by taking its residue type and the environment in which it occurs. Taking the average over the number of structures available in the family, the predicted substitution pattern at each position for each of the 20 amino acids is calculated. The sequence-based score, termed the divergence score (converted to Z-score) quantifies the overall difference, or divergence, between the observed and predicted substitution probabilities. 16 The overview of the functional site prediction method is shown in Figure 8. The automatic prediction of functional residues on the structure of the interacting proteins is performed as follows. The divergence score values per residue are mapped onto a three-dimensional grid of points and contoured using kin3dcont as recently described. 16 A maximum Z-score is chosen such that the number of grid points above this Z-score is greater than or equal to Then the cut-off is set:

12 1680 Protein Protein Docking Restraints using ESSTs Figure 8. Overview of the functional site prediction method. The broken lines indicate the multiple sequence alignment. The broken lines in brown, dark blue and light blue denote the representative structure (or structure of interest), homologous structures and homologous sequences of the representative structure, respectively. cut-off Z meankðmaximum Z-scoreÞs (1) where s is the standard deviation. All grid points with Z-scores above the cut-off are clustered, and the clusters ranked in size. Clusters separated by less than 5 Å are merged. A residue is predicted to be functional if it has an atom within 0.8 Å of a grid point in the largest cluster. These assignments are used to determine the percentage of correctly predicted functional residues. Use of predicted interface residues in HADDOCK We have included the predicted interface residues as restraints for running HADDOCK as follows. We selected the residues that are inside the largest cluster defined from the grid mapping of divergence scores (see the previous section). Among them, the active residues were defined as the solvent-accessible ones (relative ASA R7%) that had positive Z-scores for the divergence score. Passive residues were defined as the solventaccessible ones inside the largest cluster that had negative Z-scores for the divergence score. We took 2000 docking solutions after the first rigid-body docking step in HADDOCK. Then, the 50 lowest energy solutions were refined in two consecutive steps: without and with explicit water molecules, as described in the original HADDOCK protocol. The rigid-body docking step took 50 h (for 2000 solutions) in ten Xeon 2.4 GHz CPUs. The subsequent refinement steps took 120 h (for 50 solutions without water refinement and 50 solutions with water refinement) in ten Xeon 2.4 GHz CPUs. pydockrst: implementation of restraints to score rigid-body docking solutions We have developed a protocol to evaluate rigid-body docking solutions, as part of the suite of docking programs called pydock, which will be described in more detail in a forthcoming publication. The rigid-body docking solutions are generated by the FFT-based programs ZDOCK and FTDOCK. 26 Then, the docking solutions are automatically evaluated with the module pydockser, by equation (2): E bind Z 0:5 E ele CE des (2) where E ele is the binding electrostatics energy (Coulombic potential with distance-dependent dielectric constant ez4r, truncated to a maximum and minimum value of C1.0 and K1.0 kcal/mol, respectively); and E des is the

13 Protein Protein Docking Restraints using ESSTs 1681 desolvation energy upon binding, based on atomic solvation parameters previously optimised for rigidbody docking. 4,18 Then, the docking solutions are evaluated with the module pydockrst. Interaction restraints are defined by specific residues that are likely to be in the interface. These restraint residues can be defined from NMR data, mutational experiments, biological information, or from functional site predictions as described here (see the following sub-section). An interaction restraint by a given residue A is satisfied if at least one atom of the partner molecule can be found at %6 Åfrom the centre of mass of this residue A. For each docking solution, the method computes the percentage of satisfied restraints with respect to the total number of possible restraints, and this number is converted to energy by equation (3): restraint energy Z ðk1:0 kcal=molþ! (3) ðpercentage of satisfied restraintsþ The final energy is formed by the sum of the binding and the restraint energies (E total ZE bind Crestraint energy; as defined by equations (2) and (3)). The method is implemented in python, uses the MMTK library, 27 and is part of the pydock suite of docking tools (forthcoming publication). In order to evaluate the performance of the pydockrst program, we have tested it on three cases that have been recently run in HADDOCK. 8 For consistency, we used in pydockrst the same 3D structures as those reportedly used in HADDOCK. The first docking case was the N terminus domain of Enzyme I (EIN; free form pdb 1ZYM) in complex with the histidine-containing phosphocarrier protein (HPr; free form pdb 1POH). The structure of the EIN/HPr complex has been solved by NMR (PDB code 3EZA). The second docking case was the Enzyme IIA glucose (E2A; free form PDB code 1F3G) in complex with HPr. The structure of the E2A/HPr complex has been solved by NMR (PDB code 1GGR). The third docking case was the HIV protein gp120 in complex with the protein CD4. The structures of the subunits were taken from the X-ray structure of the gp120/ CD4 complex (PDB code 1GC1). The interaction restraint residues that we used for pydockrst were the active residues used as AIRs for HADDOCK, as follows. For the EIN/HPr complex, we used 16 amino acids of EIN (E67, E68, K69, A71, I72, D82, E83, E84, G110, Q111, S113, A114, E116, E117, L118 and Y122) and nine amino acids of HPr (H15,T16,R17,Q21,K24,K49,Q51,T52andG54),as derived from NMR data. For the E2A/HPr complex, the restraint residues were 11 amino acids of E2A (D38, V40, I45, V46, K69, F71, S78, E80, D94, V96 and S141) and nine amino acids of HPr (H15, T16, R17, A20, F48, Q51, T52, G54 and T56), as derived from NMR data. For the gp120/cd4 complex, the restraint residues were four from gp120 (D368, E370, W427 and D457) and seven from CD4 (K29, K35, F43, L44, K46, G47 and R59), as derived from mutational data. In order to test the functional site predictions in pydockrst, the restraint residues were formed by the accessible (relative ASA of side-chain R7%) residues from the largest cluster, with positive Z-scores of the divergence score values. Acknowledgements We are grateful to Alexander Bonvin for the use of HADDOCK; to Zhipping Weng for the use of ZDOCK; and to M.J.E. Sternberg for the use of FTDOCK. V.C. is a recipient of Cambridge University Nehru and Overseas Research Scholarships. J.F.-R. is a recipient of a Marie Curie Research Fellowship from the European Commission. References 1. Camacho, C. J. & Vajda, S. (2002). Protein protein association kinetics and protein docking. Curr. Opin. Struct. Biol. 12, Elcock, A. H., Sept, D. & McCammon, J. A. (2001). Computer simulation of protein protein interactions. J. Phys. Chem. ser. B, 105, Fernandez-Recio, J., Totrov, M. & Abagyan, R. (2002). Soft protein protein docking in internal coordinates. Protein Sci. 11, Fernandez-Recio, J., Totrov, M. & Abagyan, R. (2004). Identification of protein protein interaction sites from docking energy landscapes. J. Mol. Biol. 335, Smith, G. R. & Sternberg, M. J. (2002). Prediction of protein protein interactions by docking methods. Curr. Opin. Struct. Biol. 12, Sternberg, M. J., Gabb, H. A. & Jackson, R. M. (1998). Predictive docking of protein-protein and protein DNA complexes. Curr. Opin. Struct. Biol. 8, Wodak, S. J. & Janin, J. (1978). Computer analysis of protein protein interaction. J. Mol. Biol. 124, Dominguez, C., Boelens, R. & Bonvin, A. M. (2003). HADDOCK: a protein protein docking approach based on biochemical or biophysical information. J Am. Chem. Soc. 125, Mirny, L. A. & Shakhnovich, E. I. (1999). Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J. Mol. Biol. 291, Lichtarge, O. & Sowa, M. E. (2002). Evolutionary predictions of binding surfaces and interactions. Curr. Opin. Struct. Biol. 12, Valdar, W. S. & Thornton, J. M. (2001). Protein protein interfaces: analysis of amino acid conservation in homodimers. Proteins: Struct. Funct. Genet. 42, Aloy, P., Querol, E., Aviles, F. X. & Sternberg, M. J. (2001). Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J. Mol. Biol. 311, Caffrey, D. R., Somaroo, S., Hughes, J. D., Mintseris, J. & Huang, E. S. (2004). Are protein protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci. 13, Overington, J., Donnelly, D., Johnson, M. S., Sali, A. & Blundell, T. L. (1992). Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci. 1, Overington, J., Johnson, M. S., Sali, A. & Blundell, T. L. (1990). Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc. Roy. Soc. ser. B, 241, Chelliah, V., Chen, L., Blundell, T. L. & Lovell, S. C. (2004). Distinguishing structural and functional restraints in evolution in order to identify interaction sites. J. Mol. Biol. 342,

Prediction of Protein-Protein Binding Sites and Epitope Mapping. Copyright 2017 Chemical Computing Group ULC All Rights Reserved.

Prediction of Protein-Protein Binding Sites and Epitope Mapping Epitope Mapping Antibodies interact with antigens at epitopes Epitope is collection residues on antigen Continuous (sequence) or non-continuous