SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure

Size: px
Start display at page:

Download "SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure"

Transcription

1 SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure Giuliano Armano, Filippo Ledda, and Eloisa Vargiu Department of Electrical and Electronic Engineering University of Cagliari, Piazza d Armi, I-09123, Cagliari, Italy 1 Introduction Proteins carry out most of the cell functions, including cell signaling, particle transportation, catalytic reactions, and immuno responses. The function of a protein is mainly related to its three-dimensional folding, also called the protein tertiary structure. The folding mechanism is still far from being completely understood, despite great research efforts. Due to the strict relationship between protein function and structure, the prediction of tertiary structure has become one of the most important tasks in bioinformatics in recent years. In fact, notwithstanding the increase in experimental data on protein structures available in public databases, the gap between known sequences and known tertiary structures is constantly increasing. 1.1 Anfinsen s Dogma and Levinthal s Paradox In principle, Anfinsen s dogma (Anfinsen, 1973) makes us confident about the fact that the primary structure of a protein, i.e., the corresponding sequence of amino acids, is all we need to infer the folding (at least for a globular protein). Simulating the folding process, which essentially consists of energy minimization in a water environment, seems to be a straightforward way to solve the problem. Unfortunately, according to the experiments performed by Levinthal (Levinthal & Rubin, 1968), the folding process has on average degrees of freedom, which generates a number of alternatives still intractable in computer simulations. This enormous difference between the actual speed of the folding process and the computational complexity for evaluating the corresponding model is also called Levinthal s paradox. It is worth pointing out that molecular simulators that use some heuristics for reducing the search space have been developed, but the uncertainty about the degree of approximation of the actual structure limits their use only to very short chains or small perturbations around a known structure. 1.2 Protein Structure Prediction To overcome the absence of a deep model able to solve the folding problem, with modeling at the molecular level still intractable, searching in public databases is the first thing that a researcher in structural bioinformatics usually does while looking for a protein structure. The most important of these databases is the Protein Data Bank (PDB) (Berman et al., 2000). Unfortunately, the chances of finding the target protein listed in the PDB are not so high. In fact, notwithstanding the increase in experimental data on protein structures, the gap between known sequences (about 13 millions entries in UniProt in April 2010) and known tertiary structures (over 60,000 entries in PDB in May 2010) is exponentially increasing. Alternatively, the researcher can rely on different tools

2 SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure or techniques for protein structure prediction that have been developed to fill the gap (for a comprehensive review, see (Rost & Sander, 1996)). Given a target protein, the choice about which technique should be adopted to deal with the folding problem is mainly related to the observed degree of homology (i.e., evolutionary relationship) against known structures. In particular, when a template protein with a confidently high degree of homology with the target sequence is found (indirectly observable from the degree of sequence identity), a good sequence alignment between the template and the target (together with some interpolation) is all we need to infer its model. In the absence of a full template protein, but still in the presence of some sequence identity, more sophisticated techniques (Jones & Hadley, 2000) are used to compare the target against a library of structural templates, with the fold with the highest matching score assumed to be the one that best fits the target sequence. The most difficult case occurs when no clear homology relationship can be detected (very low sequence identity against known proteins). In this case, a specific approach is required called ab-initio to highlight the absence of useful information from biological repositories. This approach implies modelling all the energetics involved in the process of folding and then finding the structure with the lowest free energy. 1.3 Secondary Structure Prediction The prediction of tertiary structure is still a very difficult task; thus, most methodologies concentrate on the simplified task of predicting their secondary structure (SSP hereinafter). In fact, the knowledge of secondary structure is a useful starting point for further investigating the problem of finding protein tertiary structures and functionalities. Although in principle the folding information is contained in the primary structure of a protein, no deep (and tractable) model has been found so far able to fullfil Anfinsen s dogma. As for shallow models (in particular, models that rely on statistical information), encoding methods play a fundamental role in representing the available information in a form directly usable by the underlying predictor. Despite the importance of encoding methods, we claim that other kinds of information should also be taken into account to perform SSP, with particular emphasis on the intrinsic correlation of amino acids that belong to a secondary structure. Depending on the results of this analysis, we have devised an abstract software architecture called SSP2 that follows the guidelines imposed by all information sources deemed relevant to implement a successful secondary structure predictor. From a conceptual perspective, the architecture applies to some extent the fixed point approach, whereas from a technological perspective, it revisits the classical PHD architecture with new insights that better explain why most of the successful predictors comply with it. As the task of designing, implementing, and tuning secondary structure predictors is particularly critical, a suitable tool for rapid prototyping called Generic Architecture Based on Multiple Expert (GAME) has been devised and implemented in advance. Specifically tailored for building secondary structure predictors, GAME is a stand-alone, portable, and easy-to-use integrated environment that permits rapid prototyping of systems based on multiple experts. From an implementation point of view, GAME relies on full portable Java libraries and is released with an open source license as an unpack-and-run binary bundle. A system implemented with GAME typically consists of software experts that interact to provide the requested functionality. As only containment (i.e., has-a) relations are supported in GAME, a system can always be represented by a tree in which any nonleaf node actually embeds one or more software experts according to their reference type (i.e., ground expert, refiner, or combiner). Using GAME mainly allows bioinformaticians to graphically connect and set up automated experts (i.e., classifiers or predictors), as well as to select suitable encoding methods, learning techniques, and output combination methods. Different indexes for performance evaluation are available, including those based on correlation coefficients and confusion matrices. K-fold cross validation is also supported to provide the possibility of improving the statistical significance of experimental results. Beginning with GAME and SSP2, a suitable predictor called GAMESSP2 has been implemented, fully compliant with the SSP2 architecture. Experiments

3 Giuliano Armano, Filippo Ledda & Eloisa Vargiu run on a large set of proteins with identity 25% report an accuracy typically higher than those reported by cutting-edge secondary structure predictors. The chapter starts with a brief introduction about the main issues in protein secondary structure prediction (Section 2). Section 3 introduces the SSP2 generic architecture, and Section 4 briefly describes the GAME tool. Section 5 describes the GAMESSP2 system, with emphasis on the implementation details and comparative experimental results. Section 6 presents the conclusion. 2 Main Issues in Secondary Structure Prediction Secondary structures are local patterns of folding mainly related to hydrogen bonds, the most important of which are the rigid helix conformations (alpha helices) and grouping of distant segments into compact sheets (beta sheets). The knowledge of the secondary structure has proven to be a useful starting point for investigating further the problem of protein tertiary structure and function prediction (McGuffin & Jones, 2003;?). In the next subsections, the main issues that characterize the problem of predicting protein secondary structures are analyzed, focusing on performance assessment, input representation, output representation, and prediction techniques. 2.1 Performance Assessment Since the existence of secondary structure predictors, performance assessment has been a main and, for a long time, controversial issue. In this section, the different aspects related to this issue are examined. Particularly, the main types of errors occurring in SSP, together with the most relevant standard measures, testing issues, and expected improvements in the field, are discussed in the next paragraphs Kinds of Errors and Standard Evaluation Measures Prediction errors can be divided in two main categories: (i) local errors, which occur when a residue is wrongly predicted, and (ii) structural errors, which occur when the structure is globally altered. The most unwanted errors are the latter, with secondary structure segments as the basic components of the three-dimensional structure of a protein. In particular, errors that alter the function of a protein should be avoided whenever possible. Unfortunately, they cannot be easily identified; thus, local measures or generic segment superposition measures are usually adopted. Q 3 and Matthews Correlation Coefficients (MCCs) are commonly used measures of local errors, whereas the Segment Overlap Score (SOV ), which accounts for the superposition of correct and predicted structure segments, is the most well known measure for structural errors. These measures have been adopted in the context of CASP (Moult et al., 1995) and EVA (Eyrich et al., 2001). They will be reported as follows for the sake of completeness (e, h, and c stand for alpha-helices, beta-sheets, and coils, respectively). Q 3. It is a measure of accuracy, is largely used for its simplicity, and accounts for the percent amino acids correctly predicted. It is defined as follows: Q 3 = i=h,e,c Q 3i 3 (1) Q 3i = 100 t p i N i, i {h,e,c} (2) where N i is the number of residues observed for structure category i, and t pi (i.e., true positives) is the corresponding number of correctly predicted residues.

4 SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure SOV. Defined in (Zemla et al., 1999), it accounts for the predictive ability of a system by considering the overlapping between predicted and actual structural segments 1. C h, C e, C c. Defined in (Matthews, 1975), this measure basically relies on the concept of confusion matrix. Evaluated on a specific secondary structure, MCC gives rise to an estimation of the prediction quality. For secondary structures, MCCs are defined as follows: C i = t pi t ni f pi f ni, i {h,e,c} (3) (tpi + f pi )(t pi + f ni )(t ni + f pi )(t ni + f ni ) where t pi represents the number of true positives, t ni the true negatives, f pi the false positives, and f ni the false negatives. Note that MCC is not defined in the event that no occurrences of class i are found SSP Testing: Which Proteins Should be Used to Test SSP Sytems? Another fundamental issue is about which protein sets should be used to test the predictors. The machine learning theory suggests that test samples must be different from the training samples, which in turn are expected to represent well the concept to be learned. This definition only partially fits the requirements that apply to SSP. The presence of homologues may alter the perceived performance as predictors are mainly used against nonhomologues: if a homologue is available for the target protein, its structure can be usually assigned without the need for a secondary structure predictor. Hence, a secondary structure predictor should be tested with proteins that are not homologues with any of those used for training. In principle, this definition requires the analysis of protein families, throughout resources for structural classification, such as CATH (Orengo et al., 1997) or SCOP (Murzin et al., 1995). In most cases, a safe threshold of 25% sequence identity against proteins used in the training set is applied while building the test set. This concept has become clear during CASP, the annual conference whose purpose is to assess the advances in protein prediction. After CASP4, automated testing has been considered to be more appropriate for secondary structure predictors, giving birth to the EVA server, which automatically asks well-known prediction servers to predict target proteins before their actual structure is made publicly available Presentation of Results: Macro vs. Micro Averaging While reporting experimental results, an average for Q 3, SOV, and MCCs on a given test set is usually given. Two kinds of averages can be computed: (i) by chain, in which measures are computed for every single chain and then averaged on the test set, and (ii) by residue, in which measures are computed for the whole test set, as if they were a big single chain (alternatively, it can be considered an average-by-chain weighted by chain length). The average-by-chain usually leads to lower scores, with short proteins being usually more difficult to predict. Further problems may arise while evaluating SOV and MCCs. Average-by-residue scoring does not make sense for the SOV measure, although it may be equivalently computed as a weighted average. As for MCC, the average-by-chain has to be interpreted with care: when the MCC is not defined, the maximum value (1.0) is usually taken, leading to an average whose absolute value does not reflect the real performance for the different structure categories. Moreover, when average-by-chain is applied, the C e may be larger than the C h value, giving the impression that the beta-sheets are easier to predict than the alpha-helices. However, this is only a distortion due to the fact that the C e value is undefined with an occurrence higher than C h. 1 A detailed description of the SOV definition is available at sec.html

5 Giuliano Armano, Filippo Ledda & Eloisa Vargiu Asymptotic Limit and Expected Improvement Rost (Rost, 2001) asserted that the limit on the maximum Q 3 obtainable in SSP is about 88%. This limit is mainly due to the intrinsic uncertainty about structure labeling, which in turn is related to the dynamics of the target protein structure and to the thresholds used by standard programs (e.g., DSSP (Kabsch & Sander, 1983)) to perform the labeling task. While significant improvements have been obtained since the first predictors, no relevant steps forward have been made in the last 10 years to reach the cited limit. Obtain improvements of even a fraction of a percent has become challenging. Current state-of-the-art predictors claim a performance of about 77% 80% Q 3 and 73% 78% SOV, depending on the dataset adopted for testing. As for the structures to be predicted, alpha helices (C h = ) generally appear to be far easier to predict than beta strands (C e = ). A main problem in SSP is to capture the widespread correlations originating from the beta sheets. 2.2 Input Representation Predicting the overall structure from the target sequence only is currently unfeasible, due to the inherent difficulty of obtaining a computable deep model that can characterize the folding process. For this reason, researchers have concentrated their efforts in studying shallow models mainly based on statistics. These models rely on input representation (also called pre-processing or input encoding) as the main tool for representing the biological information contained in the primary structure in a form directly usable by the underlying predictor. There are two main ways to represent a protein sequence: (i) position independent, which means that the same amino acid is always represented in the same way regardless of its position, and (ii) position specific, which typically exploits the evolutionary information contained in multiple alignments. One-hot is a relevant representation typically used to encode nominal data, i.e., data characterized by (nonordered) multiple values. It uses n bits for representing an n-valued datum, with one bit for each value. A vector of 20 bits is required to encode a specific amino acid occurring in a protein (Qian & Sejnowski, 1988); thus, a protein of length M can be represented by a binary array of M x 20 bits. In so doing, a straightforward position independent encoding can be obtained for amino acids. The capability of exploiting evolutionary information was the main innovation introduced in the 1990s for representing proteins. Homologous proteins share a common three-dimensional structure, and this is reflected in the similarity between primary sequences. As a result, given a protein sequence, a multiple alignment obtained from a set of similar sequences found in a sequence database is expected to contain the information about the substitutions that occurred within the protein family during its evolution without compromising the structure. To predict the secondary structure, the substitution frequencies extracted from a multiple alignment considered representative of the target sequence in hand have proven to be effective inputs. The i-th position of a target protein being encoded with a real vector of 20 elements (each approximating the probability that a given amino acid occurs at position i in the family to which the target protein belongs), a protein of length M can be represented by an array of M x 20 real-values. Substitution frequencies are a notable example of position-specific encoding, as the same amino acid may be represented in different ways, depending on its position within the protein being encoded. Analyzing the history of secondary structure predictors, most of the reported improvements are related to input representation. The use of evolutionary information in the task of SSP dates back to 1976 (Dickerson et al., 1976) and was applied in 1987 by Zvelebil et al. (Zvelebil et al., 1987). PHD (Rost & Sander, 1993a), the first system able to break the barrier of 70% Q 3, used multiple alignment frequencies obtained from the HSSP databank (C. Sander, 1991) to encode target proteins. After PHD, frequency-based encoding based on multiple alignment has become the standard de-facto of any input representation for SSP. A further improvement in PHD performance, specifically, from 70% to 72% Q 3 (Rost, 1996), was obtained by adopting BLAST (Altschul et al.,

6 SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure 1990) as the search algorithm and CLUSTALW (Higgins et al., 1994) as the alignment algorithm. PSIpred (Jones, 1999), inspired by PHD, has been able to reach an accuracy of 76% by directly using Position-Specific Scoring Matrices (PSSM) built by PSI-BLAST (Altschul et al., 1997), the iterative version of BLAST. Here, let us briefly recall that a PSI-BLAST search starts with invoking BLAST, which returns the first PSSM. At each subsequent iteration, PSI-BLAST uses the current PSSM to drive the search. In so doing, the PSSM is progressively refined, and the algorithm is able to improve its generalization ability by progressively discovering sequences that show a high degree of compatibility with the current PSSM. However, iterating the search may cause a search drift that may include spurious sequences, possibly excluding sequences more related with the target sequence (initial query) or even the target sequence itself. A small number of iterations (2 3) is often a good compromise between generalization ability and drift effect. The main improvement of PSSMs consists of refining the raw positionspecific frequency counts with additional pseudo-counts, carrying position-independent information extracted from scoring matrices (e.g., BLOSUM (Henikoff & Henikoff, 1992)). The improvement obtained by PSIpred depends also on the fact that the database containing proteins has been filtered by removing the potential causes of the drift: redundant sequences, simple sequences, and transmembrane regions. After PSIpred, PSSM has been acknowledged to be the most effective encoding method available for SSP and has also been used by other software architectures and systems (e.g., SSPRO (Baldi et al., 1999), (Ward et al., 2003)). Predictors that exploit different encoding methods and formats have also been devised. For instance, Karplus et al. (Karplus et al., 1998) used an encoding built upon Hidden Markov Models (HMMs) (Rabiner & Juang, 1986), whereas Cuff and Barton (Cuff & Barton, 2000) assessed the impact of combining different encoding methods in the task of secondary structure prediction. We contributed to the encoding topic by proposing Sum-Linear Blosum (SLB) (Armano et al., 2009). In sum, the current encoding methods are unable to do without position-specific encoding. An up-to-date input representation task typically involves the following activities: search, alignment, and encoding. Assuming that a multiple alignment ranges over M positions (from 0 to M 1) and contains L sequences (from 0 to L 1), the encoding process can be considered a function that maps the alignment to a profile of M 20 real-valued vector. As a final comment on input presentation, let us recall that a target sequence is typically turned into realvalued vectors to make it easier for the adoption of well-known learning techniques to operate on fixed-length inputs. As sketched in Figure 1, sliding windows are the most common technique for obtaining fixed-length realvalued vectors (slices hereinafter) from input profiles. It is worth noting that, while generating slices for the first amino acids, part of the sliding window actually points outside the input profile. For a window with length = 7, the first, second, and third amino acids require a specific treatment. The same problem occurs while encoding the amino acids at position M 1, M 2, and M 3. A typical solution to this problem consists of adding an additional input entrusted with making the border-line conditions explicit. 2.3 Output Representation Similar to input data, output data are typically encoded to be suitably dealt with by the underlying prediction algorithm. DSSP is the most widely used set of annotations for secondary structures; it is introduced by the homonym automatic labeling tool. The output categories of DSSP are H, G, I, E, B, T, S, and C (the latter is actually a none assignment represented by white space). In SSP, a simplified version of DSSP, say DSSP HEC, is typically adopted, which maps each of the eight initial categories to alpha-helix (H), beta-strand (E), or coil (C). DSSP HEC is related to DSSP throughout a look-up table. As no lookup table is universally accepted, in this chapter, we refer to the most acknowledged as reported below: DSSP H G I E B T S C DSSP HEC H H H E E C C C

7 Giuliano Armano, Filippo Ledda & Eloisa Vargiu Figure 1: Using sliding windows to extract fixed-length slices from input profiles: a window centered on the fifth amino acid of the target sequence. Once a protein sequence is annotated, several choices can be made about the actual encoding. The most common (for instance, adopted by PHD) uses a one-hot representation, with H =< 1,0,0 >, E =< 0,1,0 >, and C =< 0, 0, 1 >. As an actual predictor outputs real-valued (instead of binary) triples, obtaining a measure of reliability for every single prediction in a simple way is also possible. It has been statistically proven that, for predictors based on artificial neural networks (ANNs) (Rumelhart et al., 1986), the difference between output values is proportional to their accuracy (Rost & Sander, 1993b). In this case, the reliability can be expressed through a confidence value χ (also called reliability index), defined as follows: χ i = 10 (max(ṽ i ) next(ṽ i )) where a) Ṽ i is a triple representing the output of a predictor at position i that is expected to approximate the adopted one-hot encoding, b) max(ṽ i ) and next(ṽ i ) denote the maximum and its closest value in the triple Ṽ i. Among the other proposals for output encoding, let us recall the coding schema proposed in (Chandonia & Karplus, 1995), which denotes a category in {H,E,C} with H =< 1,0 >, E =< 0,1 >, and C =< 0,0 >. According to the given definition, coil is implicitly asserted during a prediction when none of the values in the output pair exceeds a given threshold. A further notable encoding (Petersen et al., 2000) considers three adjacent positions, i.e., a window of width 3 along an HEC profile. 2.4 Prediction Techniques Many prediction techniques are available off-the-shelf to deal with SSP. Nevertheless, this problem requires great attention when selecting the technique that fits the software architecture devised best. SSP is a peculiar problem, whose most notable characteristics are as follows: Great input space: In a typical setting, i.e., sliding window of length 15 with a PSSM encoding, the input dimension is = The additional value (21 instead of 20) comes from the fact that an extra input is typically used to encode positions that lay outside the sequence.

8 SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure Large training set: PDB available structures are about 60,000. Upon the removal of homologous sequences, a typical dataset contains about 5000 proteins. Assuming an average length of 250 residues per protein, the overall number of data samples is more than 1 million. The dimension of the training set is clearly a limitation for some computational methods. Moreover, the predictors that require all samples to be stored in memory at the same time cannot be trained with all the available data. Low input/output correlation: Although increased by injecting information about multiple alignments (in the representation of inputs), the correlation is still very difficult to identify when resorting to shallow models. Enlarging slices may be assumed to facilitate the task of learning the actual I/O relation. Unfortunately, larger slices do not necessarily entail better predictions. Whereas sliding window enlargement may actually facilitate the correlation between input and output, the generalization capability of a predictor may be negatively affected by this change (for further discussion about sliding windows, see, for example, (Chandonia & Karplus, 1996)). Discrepancy between training set and test set: One of the main postulates of machine learning is that training inputs must be highly representative of the whole input space; the higher the representativeness, the better the expected performance of a classifier or predictor is. However, for secondary structure prediction, this assumption must be neglected to some extent because we are only interested in predicting proteins whose structure is different from any other known protein. Labeling noise: Due to protein mobility, measurement and/or automatic labeling errors introduce some aleatory behavior superimposed on the correct data used for learning. In the literature, various techniques have been applied to secondary structure prediction. Early prediction methods rely on the propensity of amino acids to belong to a given secondary structure (Chou & Fasman, 1974). A second generation of predictors exhibit better performance by exploiting protein databases as well as statistical information about amino acid subsequences. Several methods exist in this category that can be classified according to: (i) the underlying approach, including statistical information (Robson, 1976), graph-theory (Mitchell et al., 1992), multivariate statistics (Kanehisa, 1988), linear discriminant analysis (King & Sternberg, 1996), K- Nearest Neighbors (Yi & Lander, 1993), (Salamov & Solovyev, 1995), and ANNs (Qian & Sejnowski, 1988); (ii) the kind of information actually taken into account, including physico-chemical properties (Ptitsyn & Finkelstein, 1983) and sequence patterns (Taylor & Thornton, 1983); and (iii) the prediction strategies and architectures, including (Bohr et al., 1988), (Qian & Sejnowski, 1988), (Holley & Karplus, 1989), and (Chandonia & Karplus, 1995). Third generation predictors usually exploit evolutionary information in the form of profiles obtained by multiple alignments. Let us recall some predictors based on ANNs: (Levin et al., 1993), (Rost & Sander, 1993a), (Rost & Sander, 1993b), (Solovyev & Salamov, 1994), (Riis & Krogh, 1996), (Frishman & Argos, 1997), and (Salamov & Solovyev, 1997). Other techniques have also been experimented, including bidirectional recurrent ANNs (Baldi et al., 1999), HMMs (Bystroff et al., 2000), Support Vector Machines (SVMs) (Ward et al., 2003), meta-predictors (Cuff & Barton, 1999), populations of hybrid experts (Armano et al., 1996), ANNs trained on heterogeneous input encoding (Cuff & Barton, 2000), ad-hoc combination of different prediction strategies (Lin et al., 2005), and hybrid ANN/dynamic Bayesian network systems (Yao et al., 2008). For more information about the most acknowledged techniques in predicting secondary structures, consult the Rost and Sander s review papers (Rost, 2000; Rost, 2001).

9 Giuliano Armano, Filippo Ledda & Eloisa Vargiu 3 The SSP2 Architecture In this section we introduce the SSP2 architecture that aims to capture all information sources we deem relevant for implementing a successful secondary structure predictor. From a conceptual perspective, the architecture applies to some extent the fixed point approach, whereas from a technological perspective, it revisits and improves the classical PHD architecture with new insights that can better explain why most of the successful predictors comply with it. 3.1 Information Sources for SSP A plethora of methods for secondary structure prediction can be found in the literature, dealing with the problem from different perspectives. In our view, no matter which perspective is taken, any actual system concerned with finding a shallow model for predicting secondary structures should exploit the main sources of correlation related to the SSP problem, which are as follows: sequence-to-sequence. The intra-sequence correlation is very difficult to exploit, as clearly shown by (Crooks & Brenner, 2004). The inter-sequence correlation depends on the existing evolutionary relationships (the greater the degree of homology, the higher the correlation is). Multiple alignment is a fundamental technique aimed at highlighting this kind of correlation also in the absence of a clear homology relationship. sequence-to-structure. Determining helpful rules or patterns that can make this kind of correlation explicit is very difficult. Exploiting this kind of information represents the primary goal of any technique or software architecture aimed at performing SSP. The corresponding I/O relation is typically estimated by resorting to strategies and techniques generically falling under the fields of machine learning or pattern recognition. In particular, ANNs have shown to be very effective in capturing this kind of information. structure-to-structure. A strong intra-structure correlation holds among residues sharing the same secondary structure. This kind of correlation is always local, which means that it depends on local interactions. Nevertheless, this principle of locality is not necessarily observable at the sequence level. Whereas residues that belong to the same helix or to the same strand are, by definition, very close to each other also at the sequence level, residues that belong to the same sheet but occur in different strands are typically not close at the sequence level. Despite these difficulties, we deem that this type of correlation should be taken into account while devising the output representation and the prediction algorithm. The inter-structure correlation depends on the existing evolutionary relationships (the greater the degree of homology, the higher the correlation is). This kind of correlation is more conserved than the one occurring at the sequence level. In particular, two homologous proteins have very similar structures. Each specific secondary structure predictor deals with the kinds of correlation reported above in specific ways. Let us discuss one of the most successful systems (i.e., PHD) according to this perspective. 3.2 Revisiting the PHD Architecture PHD is an archetype for a successful family of systems based on ensembles. Each component of the ensemble is a pipeline of two transformations (both implemented by an ANN): primary-to-secondary prediction (P2S) and secondary-to-secondary prediction (S2S). The overall transformation (i.e., P2S-S2S) is depicted in Figure 2. In P2S, the ANN is trained with slices extracted throughout a sliding window of fixed length from an input profile, which in turn is obtained by encoding the multiple alignment that represents the target sequence.

10 SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure Figure 2: PHD: a schematic view of a P2S-S2S transformation 3 It should be noted here that the prediction problem is actually turned into a classification problem due to the splitting of the target protein in fixed-length slices obtained using the sliding window. Thus, the central amino acid of a slice can be labeled in isolation, yielding a preliminary prediction obtained by annotating each amino acid of the target sequence with the results (position by position) of the classification. In S2S, the ANN is trained with fixed-length slices generated by the P2S module (again, through the use of a sliding window). The S2S transformation is also called refinement. In so doing, the problem is moved back from classification to prediction; information about the correlation that holds amino acids belonging to the same secondary structure is taken into account, to some extent, while performing S2S refinement. Figure 3: PHD: Overall architecture As multiple P2S-S2S transformations are involved in a prediction (to enforce diversity), further level of combination is performed on the outputs of the P2S-S2S available pipelines. The actual prediction is issued by decoding the resulting (averaged) output profile with a maximum-a-posteriori probability (MAP) criterion. PHD adopts a jury decision after performing an average-by-residue on output profiles. A schematic view of the overall architecture (with some details about the use of sliding windows) is shown in Figure 3. 3 As already pointed out, when using PSSM, the actual encoding is output by a PSI-BLAST search as a side-effect of similarity search. Nevertheless, at least in principle, one can think about the encoding process as if it were performed on the resulting multiple alignment.

11 Giuliano Armano, Filippo Ledda & Eloisa Vargiu In sum, the encoding process, the P2S transformation, and the S2S transformation can be considered successful attempts to deal with sequence-to-sequence, sequence-to-structure, and structure-to-structure correlation, respectively. Furthermore, at least in our view, the PHD architecture performs a two-level unfolding of a fixed-point strategy for secondary structure prediction. Two shortcomings arise from the PHD approach: (i) it relinquishes the possibility of feeding the S2S transformation with information coming from the input profile, and (ii) structure-to structure correlation is not used while performing error backpropagation because the output window considers only one position at a time (for outputs) in both P2S and S2S transformations. 3.3 Introducing the SSP2 Architecture Figure 4: Fixed-point view Our proposal extends the PHD architecture and points to more effective ways of exploiting relevant information sources. We deem that greater compliance with the fixed-point approach (Figure 4), together with the ability to inject structure-to-structure information during the training process, should obtain better results. We devised an abstract architecture called SSP2 that specifies a set of constraints to be fulfilled by any compliant system: ensemble. A secondary structure predictor compliant with SSP2 must be composed by an ensemble of pipelines. pipelines (fixed-point strategy). Each pipeline is expected to adopt a fixed-point strategy while making the number of unfolding steps explicit. pipelines (output encoding). Each pipeline is expected to be trained also with information about structureto-structure correlation, through suitable output encoding methods. Figure 5 shows two specific kinds of pipelines: the first consists of a two-level unfolding and the second a three-level unfolding (the encoding process and intermediate profiles are disregarded for the sake of simplicity). It should be noted that the two-level generic unfolding resembles the P2S-S2S pipeline reported in Figure 2 for PHD. The main difference is that the refiner is now fed with inputs obtained by encoding the primary sequence. Alternative unfoldings are feasible, depending on the number of fixed-point steps and on the kinds of encoding/decoding methods applied along the unfolding. In so doing, the possibility of making alternative implementation choices at each step gives rise to a great number architectural alternatives. A thorough investigation of all systems compliant with the SSP2 abstract architecture is unfeasible. We concentrated our efforts on output representation because it is less studied and is thus more promising for further improvements. To make experimental comparisons easier, a suitable tool that permits rapid prototyping in the field of SSP is required. The characteristics of this tool are summarized in the following section.

12 SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure Figure 5: An SSP2 architecture with two- and three-levels of unfolding 4 GAME GAME (Ledda et al., 2008) is a software architecture written in Java that supports rapid prototyping in building classifiers or predictors, from a multiple-expert perspective. The tool provides a graphical environment that allows to easily configure and connect automated experts (i.e., classifiers or predictors), so that non-experienced users can easily set up and run actual systems. Figure 6 shows a typical screenshot of the graphical user interface of GAME. The following features distinguish GAME from other widely used tools/frameworks (e.g., WEKA and MATLAB toolboxes): Full support for modeling real-world problems. GAME allows to model the overall transformation required to solve a given problem from raw data to a readable prediction. Data are loaded, coded with one of the available (or created ad-hoc) methods, processed, and then decoded in a usable format. Full support for improving the statistical significance of experimental results is also available, including K-fold cross validation. Support for comparative experiments. Due to the modular architecture of GAME, its graphical interface allows the changing of the setting of any module at run-time and the setting of batteries of experiments whose results are automatically logged for separate analysis. Support for expert combination. Experts are defined as autonomous entities. They can be combined in multiple ways, making it easier to devise and implement actual systems characterized by complex software architectures. GAME allows to devise various software architectures, due to its capability to combine together and/or refine already defined experts, giving rise to other experts that can be used for further combinations. It can supply different kinds of combination and meta-learning strategies, including averaging, stacking (Wolpert, 1992), bagging (Breiman, 1996), boosting (Schapire, 1999), ECOC (Allwein et al., 2001), and rotation forests (Rodriguez et al., 2006). Just in time dataset construction. The dimension of training sets can be an issue for many real-life problems. This clearly arises not only in text and image processing but also in SSP, where the construction of sliding windows entails a great amount of data duplication. The possibility of defining smart, just-in-time, dataset

13 Giuliano Armano, Filippo Ledda & Eloisa Vargiu Figure 6: Typical screenshot of the graphical user interface of GAME instances prevents a system from running out of memory (without any worsening in performance due to a caching mechanism). Support for final releases. Data are managed in their natural format. For example, systems under construction are in fact already deployed and ready to use. Due to this ability and to the full portability guaranteed by the underlying programming language, the deployment of a prototype is straightforward. Furthermore, any system built with GAME can be serialized and reloaded with both XML and the Java-integrated serialization API. 4.1 Expert Interaction A notable characteristic of GAME is the native support for handling the interaction among experts, which is carried out using has-a relationships. In particular, an expert may belong to one of the following categories: ground expert: A ground expert is an independent expert, able to output its classification or prediction without resorting to any other expert. Among the ground experts, let us particularly recall (i) learners, (ii) wrappers, and (iii) ad-hoc experts. Learners, which represent the core of GAME, are concerned with the adoption of well-known machine learning techniques or strategies. Available supervised techniques include ANNs, SVMs, and Bayes classifiers. Principal component analysis is also available as an unsupervised technique. Wrappers allow the embedding of external classifiers or predictors, including available web

14 SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure Figure 7: GAME experts: is-a and has-a view services or external programs that one wants to use. Ad-hoc experts allow the implementation of specific (hand-crafted) behaviors. refiner: Refiners are the technological solution adopted in GAME for implementing sequential combinations of experts (i.e., pipelines). A refiner is an expert that can embed another expert. A pipeline of experts can be easily created by repeatedly applying refinement. Once generated, a pipeline becomes a compound expert entrusted with processing available data that flow along the pipeline. Each expert in the pipeline, while processing information output by the previous expert, can be optionally supplied with additional information. combiner: Combiners are the technological solution adopted in GAME for implementing parallel combinations of experts. A combiner is an expert that can embed other experts. An ensemble of experts can be easily created using a combiner. Once generated, an ensemble can be considered a compound expert in which the combiner collects and processes the output of other experts. Different combination policies are available, including (weighted) averaging, (weighted) majority voting, stacking, and mixture of experts. Figure 7 reports the is-a and has-a views on GAME experts, and Figure 8 shows the internals of a learner, in which input representation, output representation, and embedded classifier or predictor are highlighted. The output encoder module in a learning system has a twofold nature. During the learning phase, the output encoder has an active role, providing encoded data to be used as learning samples from the structure annotations. In the prediction phase, the process is reversed: a specular decoder is needed to obtain the predicted annotations from the learner output according to some criterion. Usually, the MAP criterion can be adopted to select the most suitable annotation. Considering the strict interdependence between output encoder and decoder, they can be considered as the same module. Figure 8: Internals of a learner

15 Giuliano Armano, Filippo Ledda & Eloisa Vargiu Combiners, refiners, and ground experts make devising architectural solutions for any problem in hand possible. A specific architecture can always be represented by a corresponding tree, which highlights the existing has-a relationships. 5 GAMESSP2 GAMESSP2 is a realization of the SSP2 abstract architecture implemented with GAME. This section provides the details about the GAMESSP2 predictor, together with the results of those pipelined experts compliant with the SSP2 architecture that have been experimented. The feasible variations are virtually infinite; thus, we concentrated our experiments on a significant subset for performance assessment. In particular, we generated different pipelines by changing the number of unfolding steps (two or three), the input encoding (PSSM, SLB, or none), and the length of output encoding windows. 5.1 Implementation Implementing with GAME a system compliant with the SSP2 abstract architecture is straightforward. A ground expert is used for the first level of a pipeline, and refiners are used for the next levels. Variations in this reference structure are obtained by changing the values of the relevant parameters and/or the selected encoding methods. Parallel combination (where applied) is performed through a jury decision Testing Issues Tests are performed on two main datasets. The former tests aimed to assess the performance of different SSP2 unfoldings. These tests were performed on the SD576 dataset (Yao et al., 2008) based on the SCOP database definitions. The same 7-fold assessment proposed in the cited paper has been used to allow a fair comparison with the DBNN technique, which has shown to be very effective in the field of SSP. The latter tests were performed on the EVA common 6 dataset (EVAC6), consisting of 212 proteins. The training set used in this experiments, i.e., EVAtrain, is composed of proteins with a sequence identity less than 25% compared with any EVAC6 protein. Each system under testing was evaluated by Q 3, SOV, and MCCs, as defined in section Presentation of Results Similar to the EVA contest, the average-by-chain is used by default to present the results. As for MCCs, we adopt the more accurate average-by-residue in the SD576 test Search and Alignment Issues PSI-BLAST is selected as the search and alignment tool for its ability to perform rapid and efficient search of large databases, as well as for its ability to output both PSSM and raw frequency counts (the latter are used by the SLB encoding). Jones s guidelines, extracted from the original PSIpred paper and by an inspection of the current implementation of the system available online, have been mostly followed for PSI-BLAST configuration. The inclusion threshold and the number of iterations are perhaps the most important parameters of PSI-BLAST that directly control its behavior. The inclusion threshold (expressed as an e-value, i.e., the significance of the hit versus a random one) determines the likelihood of a candidate sequence to be a true homologue, thus directly affecting the decision of including it or not in the current PSSM. After several calibration trials, PSI-BLAST is run

16 SSP2: A Novel Software Architecture for Predicting Protein Secondary Structure on the data repository uniref90 4 (filtered with Jones PFILT program (Jones DT, 1994) to remove transmembrane and simple sequence regions), with the inclusion threshold set to 10 3 and the number of iterations set to 3. The default values proposed by the NCBI stand-alone version of blastpgp are used for the remaining parameters Training Technique and Parameter Setting For the MLPs used to train each expert, a variant of the backpropagation algorithm has been devised and adopted, with the learning rate = (initial value) and the momentum = 0.1. The learning rate is adjusted between iterations with an inverse-proportion law. With the goal of improving the ability of the training algorithm to escape from local minima, the training set is randomly shuffled at each iteration, so that the learning algorithm is fed with the same set of proteins given in a different order. Furthermore, each protein provides only a subset of its inputs, according to a random choice performed in accordance with a specific parameter, n. A random value k is extracted from the range [0,n 1]; the inputs with index k, k + n, k + 2n and so on are then provided to the learning algorithm. To prevent the training process from stopping with a local oscillation of accuracy (evaluated on the validation set), weights are recorded when a minimum on the validation set is encountered, but the training continues until the error on the validation set exceeds a dynamic threshold that decreases as the iterations proceed. 5.2 Experimental Results The actual GAMESSP2 system was devised after a preliminary phase devoted to experiment different kinds of pipelines. Alternative pipelines have been experimented on the SD576 dataset, and then four instances of the best-performing pipelines have been generated to produce the final GAMESSP2 system. Benckmarks have been performed with GAMESSP2 on the EVAC6 dataset SD576 Experiments (aimed at identifying best-performing pipelines) Different pipelines (Figure 9) have been tested and compared. In this setting, the relevant parameters are the type of encoding (PSSM, SLB, or none) 5 and the length of the output window (from 1 to 11). To avoid misunderstandings about which expert a window length refers to, by convention, the parameters j and k will be used hereinafter to denote the length of the output window in the first and the second level in a pipeline, respectively (the output window of the third level has length = 1 in all experiments). Figure 9 shows that two kinds of parameters are considered: the encoding method and the length of the output window. Experimental results are reported according to the selected encoding method(s), while varying the length of the output window(s). Four different kinds of pipelines have been experimented: two-level pipelines: a) ENC =< PSSM, none > or b) ENC =< SLB, none >; three-level pipelines: a) ENC =< PSSM, none, none > or b) ENC =< PSSM, SLB, none >. Tables 1 and 2 report results obtained by running implementations of the above pipelines (two-levels and three-levels pipelines, respectively), experimented on the SD576 dataset while varying the length of output windows. In each table, the best experimental results for Q 3 and SOV are highlighted in bold. It should be noted that the SLB encoding worsens while the output windows are enlarged. For this reason, it was not used as a primary encoding method Specifying the value none as an encoding method for a refiner, means that the refiner uses only the inputs provided by the previous expert in the pipeline.