BIOINFORMATICS ORIGINAL PAPER

Size: px
Start display at page:

Download "BIOINFORMATICS ORIGINAL PAPER"

Transcription

1 BIOINFORMATICS ORIGINAL PAPER Vol. 21 no , pages doi: /bioinformatics/bti328 Structural bioinformatics Disulfide connectivity prediction using secondary structure information and diresidue frequencies F. Ferrè 1 and P. Clote 1,2, 1 Department of Biology and 2 Department of Computer Science, Boston College, Chestnut Hill, MA 02467, USA Received on November 12, 2004; revised on January 24, 2005; accepted on February 11, 2005 Advance Access publication March 1, 2005 ABSTRACT Motivation: We describe a stand-alone algorithm to predict disulfide bond partners in a protein given only the amino acid sequence, using a novel neural network architecture (the diresidue neural network), and given input of symmetric flanking regions of N-terminus and C-terminus half-cystines augmented with residue secondary structure (helix, coil, sheet) as well as evolutionary information. The approach is motivated by the observation of a bias in the secondary structure preferences of free cysteines and half-cystines, and by promising preliminary results we obtained using diresidue position-specific scoring matrices. Results: As calibrated by receiver operating characteristic curves from 4-fold cross-validation, our conditioning on secondary structure allows our novel diresidue neural network to perform as well as, and in some cases better than, the current state-of-the-art method. A slight drop in performance is seen when secondary structure is predicted rather than being derived from three-dimensional protein structures. Availability: clotelab/dianna Contact: clote@bc.edu Supplementary information: Supplementary tables and figures, and the complete list of PDB codes of monomers used, can be found at clotelab/ 1 INTRODUCTION Disulfide bonds (covalently bonded sulfur atoms from non-adjacent cysteine residues) play a critical role in protein structure, as noted by Anfinsen (1973), whose pioneering work provided the first evidence that the native state of a protein is that conformation which minimizes its free energy. 1 There are relatively good algorithms, whose predictive accuracy is somewhat better than those of algorithms for secondary structure prediction, to determine whether a cysteine is in a reduced state (sulfur occurring in reactive sulfhydryl group SH), or oxidized state (sulfur covalently bonded). 2 The early methods of Fiser et al. (1992) and Muskal et al. (1990) used sequence information alone to predict cysteine oxidation state. To whom correspondence should be addressed. 1 Anfinsen reduced disulfide-bonded cysteines of bovine pancreatic ribonuclease by adding the denaturant urea; upon removal of the denaturant, the original disulfide bonds were reestablished, thus suggesting that the native state is in a global free energy minimum. 2 Disulfide-bonded cysteines are known as half-cystines; oxidized cysteines may either be half-cystines or instead covalently bonded to a metallic ligand; reduced cysteines are also called free cysteines. The former used a statistical method and achieved 71% accuracy, while the latter used a neural network and claimed 81% accuracy on a small test database. In 1999, Fariselli et al. designed a jury of neural networks, trained on flanking sequence information in neighborhoods of oxidized versus reduced cysteines. Their algorithm obtained an accuracy of 71%; when additionally trained on flanking evolutionary information (i.e. multiple sequence alignments of homologous proteins) the accuracy improved to 81%. Fiser and Simon (2000) used multiple sequence alignments in a different manner to obtain an accuracy of 82%. Mucchielli-Giorgi et al. (2002) used a combination of perceptrons, trained on sets of proteins homogeneous in terms of their amino acid content, to obtain an accuracy of 84%. In the same year, Martelli et al. (2002) used a hybrid hidden Markov model and neural network system, reaching 88% accuracy. 3 Despite the success in predicting cysteine oxidation state, there have been fewer attempts to solve the problem of determining whether two half-cystines form a disulfide bond with each other the disulfide bond partner prediction problem. Fariselli and Casadio (Fariselli and Casadio, 2001; Fariselli et al., 2002) designed a neural network to score the likelihood that given half-cystine pairs may form a disulfide bond, using flanking sequence information, and subsequently applied the Edmonds Gabow maximum weight matching algorithm to pair the most likely partners. A recent study by Vullo and Frasconi (2004) describes the successful application of recursive neural networks (Frasconi et al., 1998) to score undirected graphs that represent cysteine connectivity; evolutionary information is included to improve the prediction (in the form of vectors that label the graph vertices, i.e. the protein cysteines). This method is currently the state of the art. In this paper, we describe our method to determine the cysteines involved in disulfide bonds and to list the disulfide bond partners for the same. Beginning with the earlier observation that there is a bias in the secondary structure preference of free cysteines and half-cystines (Petersen et al., 1999), we develop a novel neural network to learn amino acid environments constituting the window contents of a symmetric region centered at partner half-cystines; the network architecture is designed with the aim of including in the training the signal that arises when using diresidue position specific scoring matrices (PSSM). Our final stand-alone program, called DiANNA (for DiAminoacid Neural Network Application), uses a diresidue neural network on the symmetric flanking residues about both cysteines 3 Direct, comparison of these accuracies is somewhat misleading, as testing was performed on different datasets The Author Published by Oxford University Press. All rights reserved. For Permissions, please journals.permissions@oupjournals.org

2 Disulfide connectivity prediction of a potential disulfide bond, along with the PSIPRED-determined (Jones, 1999) secondary structure of the residues and PSI-BLASTdetermined (Altschul et al., 1997) evolutionary information. Finally, following Fariselli and Casadio (2001), the algorithm applies Ed Rothberg s implementation (C++ program wmatch, of the Edmonds Gabow maximum weight matching algorithm (Gabow, 1973; Lovasz and Plummer, 1985) to assign disulfide bond partners, given the weighted complete graph, whose nodes are half-cystines and whose weights are values output from the neural network. This novel approach, as calibrated using receiver operating characteristic (ROC) curves (Gribskov and Robinson, 1996), shows a marked improvement over the previous works of Fariselli and co-workers (Fariselli and Casadio, 2001; Fariselli et al., 2002), and is comparable or better than the method of Vullo and Frasconi (2004). 2 SYSTEM AND METHODS 2.1 Data preparation To test our method on the same dataset used in the earlier works describing methods for the disulfide connectivity prediction (Fariselli and Casadio, 2001; Fariselli et al., 2002; Vullo and Frasconi, 2004), we selected 445 monomers from the SWISS-PROT database (Boeckmann et al., 2003) (release 39) having at least two and at most five intra-chain disulfide bonds, and for which structural data are available in the Protein Data Bank (PDB) (Berman et al., 2002). If one SWISS-PROT entry was associated with more than one PDB chain, we selected the one with the best resolution. Monomers were divided into four groups of approximately the same size, with the aim of minimizing the interset redundancy, as described in Fariselli et al. (2002), in order to perform 4-fold cross-validation experiments. For the sake of comparison, we repeated the same experiments described in Vullo and Frasconi (2004) on subsets of the whole dataset following the SCOP classification (Andreeva et al., 2004). The majority (309 of 446, 69%) of protein chains in the Vullo and Frasconi dataset (2004) were unclassified in release 1.63 of SCOP, the version used by Vullo and Frasconi. In contrast, we used the latest release of SCOP (1.65) in classifying our dataset, prepared following the procedure used by Vullo and Frasconi. The corresponding SCOP classification for our data is given as follows: α (7.3%), β (25.1%), α + β (19%), α/β (7.7%), small proteins (29.3%), peptides (3.5%) and unclassified proteins (8.2%). The number of proteins in each subset is shown in Table 1. The list of PDB monomers used in Martelli et al. (2002) has been employed for the training of an oxidation state prediction tool implemented in the DiANNA web server. Finally, we used the PDBselect25 dataset (Hobohm and Sander, 1994) to test our method on an unbiased list of proteins that includes monomers that may or may not have disulfide bonds. Secondary structure and cysteine oxidation state annotations are derived from the Dictionary of Secondary Structure of Protein (DSSP) of Kabsch and Sander (1983). We clustered the seven different DSSP secondary structure notations into three classes: (1) helix (H) alpha helix, 3/10 helix and pi helix; (2) coil (C) hydrogen bonded turn, bend and coil; (3) sheet (E) beta-bridge and extended strand. We checked the validity of disulfide bond annotation by computing the distance between sulfur atoms of annotated half-cystine partners in the dataset (average distance 2.04 Å, SD 0.105; maximum distance 2.93 Å). 2.2 Machine learning We applied two machine learning methods, neural networks (Stuttgart Neural Network Simulator, SNNS, de/snns/) and position-specific scoring matrices, to calibrate the effect of considering secondary structure in disulfide bond prediction. Throughout the following sections, P and N represent a training file of positive and negative Table 1. Protein monomer dataset B = 2 B = 3 B = 4 B = 5 B = (2,...,5) α α/β α β β Small proteins Peptides Unclassified proteins Total Number of monomers having a fixed number of disulfide bonds B (minimum 2, maximum 5), and belonging to different SCOP folds. examples, respectively, of sequence length 2w, e.g. two 11-mers corresponding to the symmetric cysteine-centered size w = 2n + 1 = 11 window contents of cysteines (i.e. the n residues N-terminal and C-terminal to each cysteine, where n = 5). Let P denote the pairs of window contents for all the half-cystines involved in an intra-chain bond, and let N denote the corresponding set of possible pairs of cysteines (intra-chain half-cystines, interchain half-cystines and free cysteines) that are not intra-chain disulfide bonds. True positive predictions occur when a half-cystine pair with a known bond is correctly predicted as such, while false negative predictions occur when known disulfide bonds are predicted not to be such. Accordingly, a true negative is a cysteine pair correctly predicted to not form a disulfide bond, while a false positive is a pair of cysteines that is not a bond though predicted as such. Letting TP, TN, FP and FN denote, respectively, the number of true positives, true negatives, false positives and false negatives, recall the definitions of accuracy, orq 2 : TP + TN TP + TN + FP + FN or TP + TN P + N, the sensitivity, TP rate (tpr), or Q c : the specificity, or Q nc : TP P TN N or or TP TP + TN, TN TN + FP, and Matthew s correlation coefficient: TP TN FP FN. (TP + FN)(TP + FP)(TN + FP)(TN + FN) The false positive rate, FP rate (fpr), is 1 minus specificity. Finally, Q p is the fraction of correctly assigned connectivity patterns, i.e. the fraction of chains for which all the predictions are correct (FP = 0 and FN = 0). To quantify the sensitivity/specificity trade-off of various methods, we considered 4-fold cross-validation. 2.3 Generalized weight matrices Weight matrices 4 can be constructed using the relative frequencies of the 20 amino acids in different positions of a set of training instances, and then used to score a test instance. Define the background set B = P N. For the set X {P, N, B} and amino acid a, let num(x, i, a) denote the number of occurrences of a in X in position i, and let f(x, i, a) denote the relative 4 In the literature, weight matrices are also known as position-specific scoring matrices (PSSM) or alternatively as profiles. In this paper, we sometimes denote collectively the monoresidue and the diresidue weight matrices, explained later in the text, by PSSM. 2337

3 F.Ferrè and P.Clote (monoresidue) frequency of a in X at position i, i.e. num(x, i, a) f(x, i, a) =. X To avoid numerators equal to 0, we add pseudocounts (Karplus, 1995), i.e. for fixed c 0 (in this work we used c = 0.2): num(x, i, a) + c f(x, i, a) =. X +20 c For amino acid sequence s 1,..., s n and 1 i n, define the positional log odds score: ( ) f(p, i, a) σ(i, a) = log 2. f(b, i, a) Once the positional log odds scores are computed for a training set of sequences, the score of a test sequence can be obtained as the sum of log odds scores τ(s) = 1 i n σ(i, s i). We denote this monoresidue weight matrix method as WM1. As reported in Zhang and Marr (1993) for the first-order Markov case and in Clote (2003) for the general case, the notion of monoresidue scoring matrix can be extended immediately to the situation of not necessarily consecutive k-tuple frequencies, for any fixed k>1. Under the assumption of positional independence, which often does not hold for biological sequence data, WM1 is provably the maximum-likelihood estimator (Clote and Backofen, 2000). Nevertheless, in some cases experimental evidence suggests that protein sequences can be more adequately modeled using diresidue (e.g. with k = 2), rather than monoresidue weight matrices (Bulyk et al., 2002). For this reason, in this work we used diresidue weight matrices, defined as follows. For set X {P, N, B} of length n sequences, for positions 1 i<j n and amino acids a, b, let num(x, i, j, a, b) denote the number of occurrences of amino acid a in position i when amino acid b is found in position j, and let f(x, i, j, a, b) denote the relative (diresidue) frequency; hence we define: num(x, i, j, a, b) + c f(x, i, j, a, b) = X c Define diresidue positional log odds ( ) f(p, i, j, a, b) σ 2 (i, j, a, b) = log 2 f(b, i, j, a, b) and diresidue score τ 2 (s) = σ 2 (i, j, s i, s j ). 1 i<j n We denote this diresidue weight matrix method as WM Neural networks We used the Stuttgart neural network simulator (SNNS, informatik.uni-tuebingen.de/snns/), and wrote Python programs as well as some batchman (SNNS) code to train and test a variety of neural net architectures implemented in SNNS. All neural networks are layered, feed-forward, fully connected nets (with the exception of the diresidue layer, described below), and trained by momentum back-propagation with a maximum of cycles. To avoid overfitting we checked the error progression on a validation set (one-fifth of the monomers from the training set of each cross-validation step, chosen randomly). In the unary representation of the neural network input encoding, given two-size w windows centered at N-terminus and respective C-terminus half-cystines, each window residue is represented by a 20 bit vector; each of the 20 bits is set to zero, except the one that is assigned to a given amino acid type. To include evolutionary information in the input encoding, we ran PSI-BLAST (Altschul et al., 1997) (three iterations, against the non-redundant SWISS-PROT + TrEMBL database of sequences) on the input sequence to produce a profile, i.e. frequencies f(i, a), for each of the 20 amino acids a and each position 1 i 2w, obtained from the multiple sequence alignment of homologous proteins. The resulting input to our neural net consisted of 2w 20 frequencies. To include secondary structure information, we extracted DSSP secondary structure annotations of each of the 2w residues, and we added to the evolutionary encoding vectors, 2w 3 additional binary inputs, which encode in unary the secondary structure (H,C,E) of each of the 2w residues. 5 The dataset of positive examples contained all the disulfide bonds annotated in the DSSP files, represented as previously described. The negative dataset contained all possible free and half-cystine pairs of each sequence that are not disulfide bonds. Half-cystines involved in interchain disulfide bonds were considered as free cysteines. Following a standard machine learning procedure, we repeatedly resampled the positive training set, so that the resulting sizes of the (amplified) positive training and (original) negative training set were equal. Observe that the positive test set was unchanged, and hence disjoint from the positive training set. Of several architectures tested, one (including two hidden layers each containing five and two units) showed the best performance. The output unit was unique, and we considered as positive those output scores higher than a threshold (0.5). We will refer to this net throughout this paper as NN2. Owing to the presence of a diresidue signal, we additionally designed an unusual neural network architecture. Considering the case of an encoded input containing secondary structure information, thus having w 23 input units, we designed a first hidden layer containing ( ) w = w(w 1)/2 2 units, one for each pair of positions 1 i<j w, with connections to input units representing the profile for residues at positions i, j and secondary structures at those positions. Thus each of the w(w 1)/2 hidden units in the first hidden layer (the diresidue layer) is connected to 2(20 + 3) = 46 input units. We designed two different diresidue neural networks namely dnn1 and dnn2, the former having the diresidue layer units connected to one output unit, and the latter having a second hidden layer, containing five units, all fully connected with those of the first hidden layer, and then fully connected to the single output unit. 2.5 Weighted match Disulfide connectivity can be described as a graph whose nodes are the half-cystines and whose edges join pairs of nodes. Connectivity prediction, i.e. prediction of disulfide bond partners, is obtained by applying the Edmonds Gabow maximum weight matching algorithm (Gabow, 1973; Lovasz and Plummer, 1985) as implemented in wmatch by Ed Rothberg (C++ program wmatch, matching/weighted), to the graph, whose nodes are the putative half-cystines and whose edges, which join pairs of nodes, are weighted by either the PSSM (WM1 or WM2) positional log odds scores or the output of the neural net in the disulfide bond prediction module. PSSM scores, which may be negative (negative values are not accepted by wmatch), are scaled in the interval (0,..., 100). A different version of the connectivity prediction module that uses a greedy approach (i.e. the bonds are chosen starting from the one with highest predicted score), was tested, but leads to poorer results (data not shown). 3 IMPLEMENTATION AND DISCUSSION The amino acid environment of half-cystines shows peculiar sequence characteristics that allow the discrimination between half-cystines and free cysteines using machine learning (Fiser et al., 1992; Fariselli et al., 1999; Fiser and Simon, 2000). Moreover, the secondary structure conformation assumed by the cysteines and their neighboring residues is remarkably different when comparing disulfide-bonded versus free cysteines (Petersen et al., 1999). Tables 2 and 3 show the secondary structure conformation 5 For example, H is encoded as 100,Cas010andEas

4 Disulfide connectivity prediction Table 2. Cysteine secondary structure frequencies Secondary structure All residue Half-cystine Free cysteine Helix Sheet Coil Frequencies of secondary structures, computed on the whole dataset using DSSP annotations. Table 3. Secondary structure of half-cystine neighbors Secondary structure DSSP(%) H H 9.3 H C 10.6 H E 1 C H 10.1 C C 36.4 C E 9.9 E H 1.2 E C 14.1 E E 7.4 Showing the Relative frequency of secondary structures flanking the N-terminus and the respective C-terminus half-cystine in a disulfide bond in symmetric size 11 window (i.e. the five residues upstream and the five downstream to each half-cystine), as tabulated from DSSP. A secondary structure is assigned to each half-cystine five C-terminal and five N-terminal residues using a majority decision (i.e. counting which secondary structure of each group of five residues is prevalent). Note the remarkable asymmetry of the coil sheet and sheet coil frequencies. frequencies detected in the analyzed dataset and computed using DSSP annotations. These values are to some extent different from those of Petersen et al. (1999), but this could be due to a different (and in our case larger) dataset. Considering the secondary structure of pairs of half-cystines known to form a disulfide bond, some combinations are preferred, presumably indicating a sort of structural complementarity (Table 4). Therefore, we explored the possibility of using sequence and secondary structure information to infer the protein disulfide connectivity, using different machine learning approaches. Figure 1 (left panel) and Table 5 show the performance of a feed-forward neural network trained with momentum back-propagation (NN2), described in the Methods section, trained using different input encodings. The inclusion of secondary structure information leads to a marked improvement as well as the inclusion of the 20 frequencies obtained in a multiple sequence alignment for each given residue of the window. [This step is known as incorporating evolutionary information and, since the seminal work of Rost and Sander (1993), has been shown to substantially increase the accuracy of neural networks for protein secondary structure prediction; similar improvements obtained using evolutionary information in predicting cysteine oxidation state and disulfide connectivity have been demonstrated (Fariselli et al., 1999; Vullo and Frasconi, 2004).] The use of secondary structure information leads to a clear improvement when using either the unary or the evolutionary encoding of the input windows. This is Table 4. Secondary structure of disulfide bonds showing frequencies of secondary structures of disulfide bond-forming half-cystines C-terminal N-terminal Percentage Percentage 11 residues secondary secondary detected expected window (%) structure structure H H H C H E C H C C C E E H E C E E The expected frequency for pairs of secondary structures, one for each half-cystine, assuming independence of each half-cysteine, are computed as the product of corresponding frequencies from Table 1. The detected frequency is computed using DSSP annotations. For example, in 9.1% of the cases in the dataset the C-terminal half-cystine is in sheet conformation, while the N-terminal is in coil conformation (this is the value reported in the Percentage detected column). Since the frequency of coil half-cystine in the dataset is 0.46, and the frequency of sheet half-cystine is 0.33 (as reported in Table 1), one can expect the frequency of bonds, in which one half-cystine is a coil and the other a helix, to be = (15.2%). This is the expected frequency, which is different from the detected frequency; moreover, the frequency of the bonds in which the N-terminal half-cystine is in sheet conformation and the C-terminal is in coil conformation is remarkably different (19.3%). In the last column, a secondary structure is assigned to the 11 residue window centered about the half-cystine, using a majority decision. even more evident when looking at ROC curves, comparing the sensitivity/specificity trade-off for different inputs used to train NN2 (Fig. 1). The PSSM can be constructed using the relative frequencies of the 20 amino acids in different positions of the cysteine-centered symmetric window. These are then used to score an input putative disulfide bond. A monoresidue scoring matrix (WM1) was computed and tested in a 4-fold cross-validation experiment, with poor results (Fig. 1, right panel). Monoresidue weight matrices are provably the maximum-likelihood estimator (Clote and Backofen, 2000), but they imply positional independence (i.e. the frequency of an amino acid a in position i of a sequence is independent of the frequency of b in position j), which may not hold for biological sequences. The notion of monoresidue scoring matrix can be easily extended to k-tuple frequencies, for any fixed k>1 (Zhang and Marr, 1993; Clote, 2003). Besides, there is experimental evidence from Bulyk et al. (2002), which proves that protein nucleotide binding in zinc fingers is more adequately modeled using diresidue (e.g. with k = 2), rather than monoresidue weight matrices. For these reasons, we applied diresidue weight matrices (WM2) to the same dataset used for WM1. ROC curves comparing the sensitivity/specificity trade-off for generalized weight matrix methods (Fig. 1) witness a diresidue frequency signal distinctive in the recognition of disulfide bonds versus non-disulfide bonded cysteine pairs. WM2 yields a 2.2-fold improvement in the true positive rate over WM1 trained with the same dataset at 10% false positive rate. Any weight matrix method, including WM2, can be turned into predictive software by allowing the user to stipulate a tolerated false positive rate fpr, then have the program determine the corresponding true positive rate tpr and 2339

5 F.Ferrè and P.Clote Fig. 1. Disulfide connectivity prediction ROC curves for different input encodings and for PSSM. Left panel: ROC curves for the NN2 neural network performance in a 4-fold cross-validation experiment. Different inputs were tested: unary representation, unary representation together with secondary structure information, evolutionary information, and evolutionary and secondary structure information. Right panel: ROC curves for WM1 and WM2 position-specific scoring matrices on the whole dataset in a 4-fold cross-validation experiment. Table 5. Neural network performance using different input encodings Unary Unary + secondary Evolutionary Evolutionary + structure secondary structure Acc Sen Spe Mcc Performance of the NN2 neural net in a 4-fold cross-validation experiment using different input encodings: unary; unary with secondary structure information; evolutionary; evolutionary with secondary structure information. Performance is evaluated by means of accuracy (Acc), sensitivity (Sen), specificity (Spe) and Matthews s correlation coefficient (Mcc). threshold weight t for ROC point (fpr,tpr), by table lookup in the precomputed ROC table. Nevertheless, since WM2 is not statistically well-founded, we turned to neural nets, trying to include the diresidue signal that arises from WM2. Two novel diresidue neural network architectures were developed, the first (dnn1) with only one hidden layer (called diresidue hidden layer) fully connected to the output unit, and the second (dnn2) provided with a second hidden layer containing five units that collect all the output from the diresidue hidden layer. More details can be found in the Methods section. The input includes evolutionary and secondary structure information; therefore each residue in the length w window is encoded by 23 units (including the three units necessary for the secondary structure information encoding). The performance of the different neural networks on the whole dataset, and on subsets of monomers having the same number of bonds B (2, 3, 4 or 5), or belonging to the same SCOP fold, is shown in Tables 6 and 7; ROC curves for the putative disulfide bonds scores are shown in Figures 2 and 3. For all values of B, Matthews correlation coefficient and Q p obtained using dnn2 are better than all the other attempted approaches. In general, the performance drops when analyzing subsets with a greater number of bonds B (with the remarkable exception of B = 4) and subsets containing a small number of examples (α, α/β, peptides and unclassified proteins). Tables and figures showing the performances on sub-subsets of proteins belonging to the same SCOP fold and having the same number of bonds B are provided in Supplementary information section (Tables 8 14 and Figs 4 10); the same general trend can be seen (the performance drops when B grows), even though in some cases the performance may be artificially good or poor since the number of monomers belonging to certain sub-subsets may be too low. In some cases (α + β having 5 bonds, peptides with 4 or 5 bonds) the sub-subsets contain too few monomers to run the 4-fold cross-validation. The number of proteins in each subset is shown in Table 1. All the previously described approaches produce as output a score, given as input a putative disulfide bond. To obtain a connectivity prediction we followed the idea of Fariselli and Casadio (Fariselli and Casadio, 2001; Fariselli et al., 2002) applying the Edmonds Gabow weight matching algorithm for connectivity prediction, using Rothberg s wmatch (C++ program wmatch, weighted). The PSSM or neural network scores are used to weight the edges of a graph whose vertices are all the cysteines of a protein; these scores are opportunely scaled. Tables 6 and 7 show the results of the application of wmatch; the ratio of proteins for which the prediction is correct is shown by the Q p index. The results show that, in general, dnn2 performs better than the other methods. Nevertheless, it should be noted that WM2 (a much simpler and faster approach, though not statistically well founded) often leads to results comparable with or even better than dnn2 (in the cases of B = 2, 3 and 4, and for the subsets α + β and small proteins). 3.1 Comparison of methods A direct comparison with the Fariselli and Casadio disulfide connectivity prediction method (Fariselli and Casadio, 2001; Fariselli et al., 2002) and with the recent work of Vullo and Frasconi (2004) is immediate, since we used the same protein monomer subsets and evaluated the performance using the same indicators. 2340

6 Disulfide connectivity prediction Table 6. Disulfide connectivity prediction performance of different algorithms NN prediction NN prediction (PSIPRED) Connectivity prediction (+ wmatch) Connectivity prediction (PSIPRED) NN2 dnn1 dnn2 NN2 dnn1 dnn2 NN2 dnn1 dnn2 WM1 WM2 NN2 dnn1 dnn2 B = 2 Acc Sen Spe Mcc Qp ND ND ND ND ND ND B = 3 Acc Sen Spe Mcc Qp ND ND ND ND ND ND B = 4 Acc Sen Spe Mcc Qp ND ND ND ND ND ND B = 5 Acc Sen Spe Mcc Qp ND ND ND ND ND ND B = (2,...,5) Acc Sen Spe Mcc Qp ND ND ND ND ND ND Comparison of the performance of NN2, dnn1, dnn2, WM1 and WM2. All these prediction methods are applied to subsets of the dataset having a number of bonds B = 2, 3, 4 or 5, and to the whole dataset [B = (2,..., 5)]. Performance is evaluated by means of accuracy (Acc), sensitivity (Sen), specificity (Spe) and Matthew s correlation coefficient (Mcc) and fraction of protein for which the prediction is perfect Q p (only when wmatch is applied). The first six columns show the performance of different neural networks in correctly distinguishing true from false disulfide bonds when the secondary structure annotations are extracted using DSSP (columns 2 4) or predicted using PSIPRED (columns 5 7). The last eight columns (columns 8 15) show the performance of the connectivity prediction obtained applying wmatch to the scores of the neural networks or the mono-residue and diresidue weighted matrices, opportunely scaled (see text for details); results of columns are obtained using PSIPRED-predicted secondary structure information. The fraction of protein chains for which the whole prediction is correct (the Q p value) in our method is comparable with that obtained for all the monomer subsets in Vullo and Frasconi (2004), and in some cases [for B = 4, B = (2,...,5)] our method outperforms the Vullo and Frasconi technique. We were able to compute ROC curves for the Fariselli and Casadio method. The Fariselli Casadio program CONPRED, when run on input consisting of symmetric flanking regions of an even number 2m of half-cystines, outputs two parts: (1) neural network scores for each of the m(m 1)/2 possible disulfide bond pairs, and (2) an assignment of disulfide bonding pattern, obtained by applying Edmonds Gabow maximum weight matching to (1). In Fariselli and Casadio (2001), disulfide bond partner assignment from CONPRED was shown to compare very favorably with respect to random assignment. To obtain ROC curves for the Fariselli Casadio method, we repeatedly ran CONPRED on symmetric half-cystine flanking regions extracted from files in the dataset, parsed the output of (1) and used DSSP disulfide bond annotation to tabulate true positive and true negative rates, necessary for ROC sensitivity/specificity values. In Figure 2 (bottom right panel), we superpose the calculated ROC curves for Fariselli Casadio CONPRED with window size w = 5 (FC5), w = 7 (FC7), w = 11 (FC11) and w = 15 (FC15) about each half-cystine, with the dnn2 ROC curve. 3.2 DiANNA web server We developed a web server, called DiANNA, that provides three services: cysteine oxidation state, disulfide bonds and disulfide connectivity prediction. The oxidation state prediction is an implementation of the procedure of Fariselli et al. (1999) described above. Evolutionary information is collected by aligning the user-submitted sequence to SWISS-PROT sequences using PSI-BLAST. Our disulfide bond connectivity prediction software is a web server that implements the diresidue neural network previously described (dnn2), fully trained with symmetric flanking regions of N-terminus 2341

7 F.Ferrè and P.Clote Table 7. Disulfide connectivity prediction performance of different algorithms NN prediction Connectivity prediction(+wmatch) NN2 dnn1 dnn2 NN2 dnn1 dnn2 WM1 WM2 α Acc Sen Spec Mcc Qp ND ND ND α/β Acc Sen Spe Mcc Qp ND ND ND α + β Acc Sen Spe Mcc Qp ND ND ND β Acc Sen Spe Mcc Qp ND ND ND Small protein Acc Sen Spe Mcc Qp ND ND ND Peptides Acc Sen Spe Mcc Qp ND ND ND Unclassified Acc Sen Spe Mcc Qp ND ND ND Comparison of the performance of WM1, WM2, NN2, dnn1 and dnn2. All these prediction methods are applied to subsets of the dataset following the SCOP structural classification. The secondary structure information used for the neural networks training is predicted by means of PSIPRED. Columns 2 4 show the performance of different neural networks in correctly predicting true from false disulfide bonds. Columns 5 7 show the performance of the connectivity prediction obtained applying wmatch to the scores of the neural networks. The last two columns show the performance of mono-residue and diresidue weighted matrices, for the diresidue connectivity prediction, applying wmatch to the PSSM score opportunely scaled (see text for details). Performance is evaluated by means of Acc, Sen, Spe, Mcc and Q p (only when wmatch is applied). and C-terminus half-cystines augmented with residue secondary structure and evolutionary information. Given two-size w windows centered at an N-terminus and the respective C-terminus putative half-cystine, we run PSIPRED on the whole sequence to predict the secondary structure (helix, coil, sheet) of each of the 2w residues; subsequently we used the PSI-BLAST run performed by PSIPRED to produce the profile of each position 1 i 2w [we tested the accuracy of the PSIPRED prediction with respect to the DSSP annotations on the entire dataset, obtaining an accuracy around 76%, similar to that claimed by Jones (1999)]. The connectivity prediction is obtained by wmatch as previously described. To test how the predicted secondary structure (instead of that extracted from DSSP annotations) affects the performance of the neural network, we trained and tested with a 4-fold cross-validation the same dataset as before, using PSIPRED predictions and evolutionary information. The results show an increase in the false 2342

8 Disulfide connectivity prediction Fig. 2. Disulfide connectivity prediction ROC curves. ROC curves for NN2, dnn1, dnn2 neural networks and WM1 and WM2 position-specific scoring matrices on the whole dataset and on subsets having the same number of disulfide bonds B. The last panel (bottom right) shows ROC curves for dnn2 compared with ROC curves for CONPRED with window size w = 5 (FC5), w = 7 (FC7), w = 11 (FC11) and w = 15 (FC15). positive rate; nevertheless, the performance is not dramatically affected (Table 6 last columns, Table 7). Standard neural networks (NN2) seem to be more affected by the reduced accuracy of the secondary structure annotation, while diresidue NN performance is still rather good. To test DiANNA on a dataset containing proteins that may or may not have disulfide bonds, in order to have an unbiased evaluation of the performance, we used the PDBselect25 (Hobohm and Sander, 1994) list of monomers. Out of 1769 monomers in PDBselect25, 1011 have at least two cysteines; of this number 2343

9 F.Ferrè and P.Clote Fig. 3. Disulfide connectivity prediction ROC curves for different protein folds. ROC curves for NN2, dnn1 and dnn2 neural networks, and WM1 and WM2 position-specific scoring matrices on subsets following the SCOP structural classification. Secondary structure information used in the neural network training is predicted from the sequence using PSIPRED, as described in the text. 2344

10 Disulfide connectivity prediction Table 8. Disulfide connectivity prediction performance of the diresidue neural network dnn2 on PDBselect25 All B B = 1 B = 2 B = 3 B = 4 B = 5 B = 6 B = 7 B = 8 B = 9 B = 12 W.F. Acc Sen Spe Mcc Qp Ox.S.F. Acc Sen Spe Mcc Qp Ox.S.F. 2 Acc Sen Spe Mcc Qp PSIPRED Acc Sen Spe Mcc Qp The performance of the fully trained dnn2 has been tested on a non-redundant dataset composed of monomers that may or not contain disulfide bonds. Column 2 shows the results obtained on the whole dataset, while the remaining columns refer to subsets having the same number B of disulfide bonds. The upper panel shows the results on the entire PDBselect25, while the second panel from the top shows the performance after a preliminary filtering step that deletes from the dataset all monomers that have less than two predicted half-cystines, using a neural network trained for oxidation state predictions. The third panel from the top shows the performance of dnn2 when a neural network trained for oxidation state predictions is used to filter individual cysteines that are predicted to be free cysteines; therefore only pairs of predicted half-cystines are submitted to the dnn2. The bottom panel shows the performance of dnn2 using the same filtering procedure described for the second panel, when the secondary structure is predicted by means of PSIPRED. (1011), 392 contain at least one disulfide bond. The perfect prediction fraction, Q p, for DiANNA is rather low (0.227); the errors mainly arise from proteins that have only free cysteines (Table 8, upper panel). To improve the DiANNA performance, we included an initial filtering step: only those monomers which have at least two predicted half-cystines (by means of the oxidation state prediction tool described above) are submitted to DiANNA. The filtering step eliminates 580 of 1011 monomers, of which 65 contain instead a disulfide bond. Testing our algorithm on the remaining 431 chains, we obtained a very good global Q p value (0.627); nevertheless, the performance of DiANNA restricted to only those proteins containing disulfide bonds (B ranging from 1 to 12), is lower (0.298). We tried also a similar approach to those used in the previous papers from Fariselli and Casadio (Fariselli and Casadio, 2001; Fariselli et al., 2002), filtering individual cysteines that are predicted to be free cysteines, but the improvement in performance is less pronounced. Using DSSP-derived, instead of PSIPRED-predicted, secondary structure annotations, leads to a remarkable improvement in performance (global Q p = 0.674, disulfide-bond containing proteins Q p = 0.418). All the results are shown in Table 8. 4 CONCLUSION In this paper, we show how to use secondary structure annotations to improve disulfide bond partner prediction in a protein given only its amino acid sequence. Even if the secondary structure is predicted by a machine learning approach instead of being derived from the known three-dimensional (3D) structure, the performance of the prediction is still remarkable. This allows the reliable application of this procedure to proteins for which the structure is still unknown. Nevertheless, it should be noted that the software performance is strongly dependent on the knowledge of protein sequences related to the monomer analyzed (both for deriving evolutionary information and for a good quality secondary structure prediction); however, this flaw is inherent in all disulfide connectivity prediction methods available up to now. A novel diresidue neural network architecture is used to simulate the strong performance of diresidue position-specific matrices trained on the same dataset. These neural networks can be applied to all those problems in which a diresidue architecture seems to represent a better model of the analyzed system (as in protein cleavage sites, Clote, 2003). In addition, these diresidue neural networks require a smaller training time compared with fully connected networks consisting of the same number of units. In some cases, diresidue PSSM performs as well as the neural network approach. Built in a modular fashion, our method combines two signals (diresidue and secondary structure) that are very different in nature, and is comparable with and in some cases better than the current state-of-the-art methods. Tested on a real case (a list of monomers 2345

11 F.Ferrè and P.Clote that may or may not have disulfide bonds, and using predicted, rather than real, secondary structure annotations), the performances are still good, obtaining a perfect prediction ratio higher than 60%. ACKNOWLEDGEMENTS We would like to especially thank P. Fariselli for furnishing us with the executable code for CONPRED, M. Sison for some programming in our initial exploratory approach, S. Alvarez for a reference, M. Muskavitch for a discussion about disulfide bonding in proteins Delta and Notch and the referees for useful comments and suggestions. REFERENCES Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, Andreeva,A. et al. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res., 32 (Database issue), D226 D229. Anfinsen,C.B. (1973) Principles that govern the folding of protein chains. Science, 181, Berman,H.M. et al. (2002) The Protein Data Bank. Acta Crystallogr. D, 58, Boeckmann,B. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in Nucleic Acids Res., 31, Bulyk,M.L. et al. (2002) Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucl. Acids Res., 30, Clote,P. (2003) Performance comparison of generalized PSSM in signal peptide cleavage site and disulfide bond recognition. In Bourbakis,N. (ed.), Proceedings of the Third IEEE Symposium on BioInformativs and BioEngineering (BIBE 03). IEEE. pp Clote,P. and Backofen,R. (2000) Computational Molecular Biology: An Introduction. John Wiley and Sons, New York. Fariselli,P. and Casadio,R. (2001) Prediction of disulfide connectivity in proteins. Bioinformatics, 17, Fariselli,P. et al. (1999) Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins. Proteins, 36, Fariselli,P. et al. (2002) A neural network based method for predicting the disulfide connectivity in proteins. In Damiani,E. (ed.), Knowledge Based Intelligent Information Engineering Systems and Allied Technologies (KES). IOS Press, pp Fiser,A. and Simon,I. (2000) Predicting the oxidation state of cysteines by multiple sequence alignment. Bioinformatics, 16, Fiser,A. et al. (1992) Different sequence environments of cysteines and half cystines in proteins. Application to predict disulfide forming residues. FEBS Lett., 302, Frasconi,P. et al. (1998) A general framework for adaptive processing of data structures. IEEE Trans. Neural Networks, Gabow,H. (1973) Implementation of Algorithms for Maximum Matching on Nonbipartite Graphs. PhD Thesis, Computer Science Department, Stanford University. Gribskov,M. and Robinson,N. (1996) The use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem., 20, Hobohm,U. and Sander,C. (1994) Enlarged representative set of protein structures. Protein Sci., 3, Jones,D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, Kabsch,W. and Sander,C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, Karplus,K. (1995) Evaluating regularizers for estimating distributions of amino acids. In Proc. Int. Conf. Intell. Syst. Mol. Biol., Lovasz,L. and Plummer,M. (1985) Matching Theory. B.V. North Holland Mathematical Studies. Annals of Discrete Mathematics, 29, North Holland Mathematics Studies 121, Amsterdam. Elsevier Science Publishers, Amsterdam. Martelli,P.L. et al. (2002) Prediction of the disulfide bonding state of cysteines in proteins with hidden neural networks. Protein Eng., 15, Mucchielli-Giorgi,M.H. et al. (2002) Predicting the disulfide bonding state of cysteines using protein descriptors. Proteins, 46, Muskal,S.M. et al. (1990) Prediction of the disulfide-bonding state of cysteine in proteins. Protein Eng., 3, Petersen,M.T. et al. (1999) Amino acid neighbours and detailed conformational analysis of cysteines in proteins. Protein Eng., 12, Rost,B. and Sander,C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232, Vullo,A. and Frasconi,P. (2004) Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics, 20, Zhang,M.Q. and Marr,T.G. (1993) A weight array method for splicing signal analysis. Comput. Appl. Biosci., 9,