Limits of homology detection by pairwise sequence comparison. Feld 280, Heidelberg, Germany

Size: px

Start display at page:

Download "Limits of homology detection by pairwise sequence comparison. Feld 280, Heidelberg, Germany"

Joseph Barker
6 years ago
Views:

1 BIOINFORMATICS Vol. 17 no Pages Limits of homology detection by pairwise sequence comparison Rainer Spang 1,2 and Martin Vingron 1 1 Deutsches Krebsforschungszentrum, Theoretische Bioinformatik, Im Neuenheimer Feld 280, Heidelberg, Germany Received on August 7, 2000; revised on December 18, 2000; accepted on December 21, 2000 ABSTRACT Motivation: Noise in database searches resulting from random sequence similarities increases as the databases expand rapidly. The noise problems are not a technical shortcoming of the database search programs, but a logical consequence of the idea of homology searches. The effect can be observed in simulation experiments. Results: We have investigated noise levels in pairwise alignment based database searches. The noise levels of 38 releases of the SwissProt database, display perfect logarithmic growth with the total length of the databases. Clustering of real biological sequences reduces noise levels, but the effect is marginal. Contact: rainer@stat.duke.edu; m.vingron@dkfz-heidelberg.de INTRODUCTION The current worldwide efforts in determining the DNA sequence of various organisms have led to vast amounts of both nucleotide and amino acid sequence data. Sequence databases already contain up to several hundred megabytes of pure sequence data, and they are expanding very rapidly. It is foreseeable that the size of databases will multiply in the near future. Similarity between molecular sequences is most commonly expressed in terms of alignment scores. In view of possible local similarities, alignment algorithms seek for local alignments, which associate only segments of the two sequences. The currently most widely used method for calculating optimal local alignments is a dynamic programming algorithm developed by Smith and Waterman (1981). For the purpose of database searches, faster programs are available which approximate the optimal Smith Waterman alignment. Among these are the classical programs BLAST (Altschul et al., 1990, 1997) and FASTA (Lipman and Pearson, 1985; Pearson and Lipman, 1988). Although alignment algorithms are designed to detect conserved regions, they produce alignments 2 To whom correspondence should be addressed. Pressent address: Duke University, Institute of Statistics and Decision Sciences, Box Duke University, Durham, NC , USA. and scores also when applied to sequences that are not related at all. In molecular database searches, most comparisons are comparisons between unrelated sequences, and random similarities are frequently observed that score higher than those arising from distant relationships. On average, chance alignment scores are smaller than those resulting from related sequences. However, since there are thousands of chance alignments computed in a database search, the highest random score can exceed the level of scores that arise from distant evolutionary relationships. Therefore, the reliability of an alignment score depends on the size of the database. These random similarities are a major obstacle in molecular database searches. For simplicity we refer to them as noise. Homology search is based on the detection of conserved segments in the sequences. For remotely related proteins, conservation is weak and normally restricted to short segments of the sequences. One often observes that there is no more than 30% identity on segments that are restricted to residues. Consider the following thought experiment: given a real sequence, we construct a second sequence by randomly replacing 70% of the residues of the first sequence by different residues. In this context of random mutations, one would not expect that the two sequence, which are 30% identical, have common structural or functional properties. However, experience shows that most of the genuine sequences with 30% identical positions share structural and functional properties. This paradox can be resolved statistically: the probability that there are two sequences in a database of sequences, with more than 30% identity is very small, if the similarity only results from a chance event. Hence for those pairs of sequence from the database with 30% identity or more, it is likely that this similarity is not by chance, but the result of a common evolutionary history of the two sequences. This common history then indicates common structural and functional features of the sequences. Our view of homology searches is that they are essentially based on this statistical rationale: if a similarity between two sequences is very unlikely to occur by chance, it is concluded that it has an evolutionary 338 c Oxford University Press 2001

2 Limits of homology search basis. However, if one performs more comparisons, the chance to find a high scoring random alignment increases. Consequently, the reliability of a sequence similarity does not only depend on the two sequences themselves, but also on the amount of data that was searched before the similarity was detected. Scores resulting from genuine evolutionary relationships obviously remain constant, while their credibility decreases as the database grows. Since homology searches are based on a statistical rationale, noise is an inherent problem. RESULTS The growth of noise were easily observed in simulation experiments. One thousand random sequences were generated. By random sequences we mean independently identically distributed random sequences, where each position was drawn according to an average amino acid distribution. All sequences are 350 residues long, which is about the average length of a protein. These sequences were used to search the SwissProt database (Bairoch and Apweiler, 1999). Searches were done using the Smith Waterman algorithm with gap costs of 15 for opening a gap and 3 for extending. As a score matrix the PAM250 matrix was used (Dayhoff et al., 1978). For each of the thousand searches, the maximal score was sampled. The experiment was repeated for 38 past releases of the SwissProt database, starting with Rel. 1 and ending with Rel. 38. We measured noise levels for each search. Noise levels were represented by the location parameters of extreme value distributions. All our samples obeyed extreme value distributions. This means that the samples were distributed as θ G + ξ, where G is a random variable satisfying Prob[G < t] = exp( exp( t)). The scale θ of the distributions was almost constant, whereas the location ξ depended on the size of the databases. The location ξ was related to the expected maximal random score E by E = ξ + c where the constant c = θγ is the product of the scale θ and Euler s constant γ This is in agreement with results from purely mathematical models (Arratia et al., 1986; Karlin and Altschul, 1990; Dembo et al., 1994b) The location parameter for each experiment was estimated using the direct estimation procedure (Waterman and Vingron, 1994). In Figure 1, the estimated noise levels are plotted versus the logarithm of the total lengths of the database. By the total length of the database we mean the sum of all lengths of sequences in the database. One can observe a significant increase of the noise from 78.4 score points in 1986 to 96.0 today. In this range of scores, we can find many pairs of remotely related proteins. On the right, the score levels of an arbitrarily chosen set of examples are marked. They illustrate how the noise catches up with real homologies. Location parameter of the noise distribution Rel Rel Rel Rel Rel Rel Logarithm of the total number of residues in the database Example 5 (Zinc finger) Example 4 (Fos) Example 3 (Death Domain) Example 2 (Kinase) Example 1 (Cytochrome) Fig. 1. The noise levels of the SwissProt releases (1 38) are plotted versus the logarithm of the total length of the database. On the right, the score levels of five pairs of distantly related sequences are shown. All pairs share a weakly conserved domain according to the annotations in the Pfam database (18). Example 1 cytochrome c: CYC HUMAN and C55L SYNY3. Example 2 kinases: ADK HUMAN (Adenosine Kinase) and SCRK ECOLI (Fruktokinase). Example 3 proteins that contain a death domain: FASA HUMAN (Apoptosis Mediating Surface Antigen FAS) and RAID HUMAN (Caspase & RIP Adaptor with Death Domain). Example 4 bzip-transcription factors: FOS CHICK (P55 C FOS Proto Oncogene Protein) and CREB RAT (CAMP Response Element Binding Protein). Example 5 zinc fingers: PEXA HUMAN (Proxisome Assembly Protein PEX10) and ME18 HUMAN (DNA Binding Protein MEL 18). Noise has reached the score level of the two cytochromes in 1987, the kinases one year later, and then it outshined the other three real homologies. The second important observation one can make, is that the data displays an almost perfect straight line. This implies that whenever we double the size of a database, random noise increases by a constant value. This observation is consistent with results which can be obtained from purely probabilistic models for sequence matching. Mathematically, alignment problems fit into the context of the Erdös Rényi law (Erdös and Rényi, 1970). For alignments without gaps, it has been shown that H nm /nm const with probability one, where H nm is the score obtained from two random sequences of length n and m respectively (Dembo et al., 1994a). There is no complete proof for the same result in the case of alignments with gaps, but it is conjectured to hold also for these settings (Waterman and Vingron, 1994). If one interprets the query as one sequence and the concatenation of all database entries as the other sequence, this theory also predicts the logarithmic growth that we 339

3 R.Spang and M.Vingron have observed. However, the absolute level of noise is different in simulations with real sequences then it is predicted by the analytical theory (Spang and Vingron, 1998). The theoretical results are based on the assumption, that the sequences are i.i.d. This is the standard model for unrelated sequences. However, biological sequences are different. An important difference is that biological sequences cluster strongly (Krause and Vingron, 1998). From this perspective it comes as a surprise that in spite of the clustering we still observe an almost perfect logarithmic growth of noise. The following simulation experiment shows that clustering slightly reduces noise levels, but the effect is marginal. The Dayhoff model of sequence evolution (Dayhoff et al., 1978) allows for the generation of sequence pairs at a given PAM distance. For a given distance t, 100 triples of sequences (Q i, Si 1, S2 i ) were generated. All sequences have length 350. A triple consists of a cluster (Si 1, S2 i ) of sequences that are t PAM apart and a query Q, which is independent of both S 1 and S 2. The alignment scores H 1 (t) i and H 2 (t) i of Q i to Si 1 and Si 2 were sampled. The experiment was repeated for PAM distances of t = 1, 5, 10, 15, 40, 60, 80, 200, 300, 400, 500. If S 1 and S 2 are identical (t = 0), we also have H 1 (0) = H 2 (0). If they differ in only a few positions, H 1 (t) and H 2 (t) are not equal; however, they are strongly correlated. Clearly, this correlation decreases if S 1 and S 2 become more distant from each other. Figure 2 shows a plot of the PAM distance versus the correlation of scores. One can observe a fast decrease of the correlation with PAM distances growing. For illustration, an exponential line is fitted to the data. In a second step the samples H 1 (t) i and max(h 1 (t) i, H 2 (t) i ) are compared. H 1 (t) i is a sample of alignment scores stemming from independent i.i.d. sequences of length 350 each. Consequently, the sample is extreme value distributed. Its location parameter ξ 1 is estimated using direct estimation. For very large PAM distances, max(hi 1, H i 2) i can be interpreted as a sample of alignment scores between independent i.i.d. sequences of length 350 and 700. Hence, it is also extreme value distributed. Let ξ max denote its estimated location parameter. If S 1 and S 2 are independent, the increase of search space yields a noise location ξ max, that is shifted by about 2.1 score units relative to ξ 1. For smaller PAM distances, this shift is reduced, due to the correlation of H 1 (t) and H 2 (t). In Figure 3, the PAM distance t is plotted versus ξ max ξ 1. The final shift value is already reached for evolutionary distances as small as 40 PAM. Both plots indicate that cluster effects play a minor role in the determination of noise levels in database searches since moderately related sequences already induce the same amount of noise as uncorrelated sequences. This also explains why correlation coefficient PAM Fig. 2. The correlation coefficients of the score samples H 1 (t) i and H 2 (t) i are plotted against the PAM distance of the Sequences S 1 and S 2. An exponential line of the form t exp( βt) is fitted to the data, for visualizing the rapid decrease of correlation coefficients. Location Shift PAM Fig. 3. The location shift between the samples H 1 (t) i and max(h 1 (t) i, H 2 (t) i ) is plotted versus the PAM distance of S 1 and S 2. The dashed line is of the form t α 1/(β t). clustering does not bring the growth of noise to a stop. DISCUSSION We have demonstrated on the example of 38 releases of the SwissProt database, that noise levels in database searches grow on a logarithmic rate with the size of the 340

4 Limits of homology search database. While the clustering of real biological sequence data can theoretically reduce noise levels, we have shown by simulation experiments that this effect is of no practical importance. Database search programs like BLAST and FASTA provide E-values. In its simplest form the E-value is just the p-value p(t) of a score level t, multiplied by the number of sequences in the database D (Altschul et al., 1994). Clearly, E-values are sensitive to the loss of credibility associated with expanding databases. Essentially, the same results are obtained if one interprets a database search as a multiple hypothesis test and uses the Bonferroni correction. In order to have a significance level α for the database search, one needs significance levels of α/d for each comparison. In Spang and Vingron (1998), we suggest that this correction might be too conservative and we recommend the use of p-values based on the effective length of a database for more accurate adjustments. The simulation results in this paper underline the importance of adjusting p-values for multiple comparisons. In fact, the situation in molecular biology should really be corrected not only for the database size, but for all queries against a database over time. If we only correct for database size to restore a 0.05 probability of false positive for a single search, then we still have 1 out of 20 searches leading to a wrong answer. This becomes important in the context of inter- or intra-genome comparisons. Our analysis is purely based on random similarities. This approach is also the rationale of the E-values, known from BLAST and FASTA. Its appeal is that there is no need to fit a model of related sequences, which would require the difficult choice of a representative sample of homologous sequences. However, this might also be seen as a shortcoming, since inference is restricted to probabilities of false positive hits. The problem of false negatives can not be addressed this way. For a comparative evaluation of database search algorithms, (Pearson, 1995; Brenner et al., 1998; Park et al., 1998) both sensitivity and selectivity are important. Consequently, random sequence similarity and the evaluation of algorithms for homology detection are different statistical problems, requiring different methodology. What can be done about the effect of growing noise? Our simulation experiments are restricted to homology search by local alignment. While this is the most commonly used algorithm for database searching, there are alternative strategies. Iterative methods such as intermediate sequence searches (Park et al., 1997), Psi-BLAST (Altschul et al., 1997), SYSTERS (Krause and Vingron, 1998) or SAM-T98 (Karplus et al., 1998) start with an initial standard database search. The findings of the first round of search are then used to construct a protein family specific scoring scheme for a second search. This updating strategy is iterated. A crucial difference to non-iterative search methods is that the scoring scheme changes when databases expand. Clearly, the growth of noise due to random sequence similarity is also a problem for these methods. However, it may be compensated by improved scoring schemes in larger databases. Iterative search methods have been shown to be successful (Park et al., 1998). As databases expand, noise grows and scoring schemes improve. One might also consider building smaller databases, that only contain some representative sequences for each protein family. These databases would be smaller and noise levels would be lower. On the other hand real signals would be weaker too. These tradeoffs have not yet been discussed in the literature. It can not be done by the methods used in this paper, since it must address both sensitivity and selectivity issues. It is a consequence of the Borel Cantelli Lemma, that infinite i.i.d. random sequences contain a perfect match to any given word of finite length with probability 1. The Erdös Rényi law (Erdös and Rényi, 1970) tells us how long we need to wait for this perfect match. And the results in Dembo et al. (1994a) generalize this result to the case of approximate matches represented by high scoring local alignments without gaps. Improving the scoring scheme by alignments with gaps (Smith and Waterman, 1981), profiles (Gribskov et al., 1987; Luthy et al., 1994), templates (Taylor, 1986) hidden Markov models (Krogh et al., 1994; Eddy et al., 1995) or jumping alignments (Spang et al., 2000), means changing the measures of similarity. While this strategy often improves database search results, it clearly can not rule out that there are unrelated sequences in the database that score higher than remote members of the family. The problem of random sequence similarities is a more general one. On one hand, sequence conservation is weak for distantly related sequences, and on the other hand, large datasets are likely to contain sequences that in fact display a stronger similarity to a given sequence although they are not related. ACKNOWLEDGEMENTS We thank two anonymous referees for very helpful comments on an earlier version of this paper. REFERENCES Altschul,S.F., Boguski,M.S., Gish,W. and Wootton,J.C. (1994) Issues in searching molecular sequence databases. Nature Genet., 6, Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhanj,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and Psi- BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25,

5 R.Spang and M.Vingron Arratia,R., Gordon,L. and Waterman,M.S. (1986) An extreme value theory for sequence matching. Ann. Stat., 14, Bairoch,A. and Apweiler,R. (1999) The SwissProt protein sequence data bank and its supplement TREMBL in Nucleic Acids Res., 27, Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, Dayhoff,M., Schwartz,R. and Orcutt,B. (1978) A model of evolutionary change in protein. Atlas of Protein Sequences and Structure, 5. Dembo,A., Karlin,S. and Zeitouni,O. (1994a) Critical phenomena for sequence matching with scoring. Ann. Prob., 22, Dembo,A., Karlin,S. and Zeitouni,O. (1994b) Limit distribution of maximal non-aligned two-sequence segmental score. Ann. Prob., 22, Eddy,S., Mitchison,G. and Durbin,R. (1995) Maximum discrimination hidden Markov models of sequence consensus. J. Comp. Biol., 2, Erdös,P. and Rényi,A. (1970) On a law of large numbers. J. Anal. Math., 22, Gribskov,M., McLachlan,A. and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, Karlin,S. and Altschul,S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA, 87, Karplus,K., Barret,C. and Hughey,R. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14, Krause,A. and Vingron,M. (1998) A set-theoretic approach to database searching and clustering. Bioinformatics, 14, Krogh,A., Brown,M., Mian,I., Sjølander,K. and Haussler,D. (1994) Hidden Markov models in computational biology. Applications to protein modelling. J. Mol. Biol., 235, Lipman,D.J. and Pearson,W.R. (1985) Rapid and sensitive protein similarity searches. Science, 227, Luthy,R., Xenarios,I. and Bucher,P. (1994) Improving the sensitivity of the sequence profile method. Protein Sci., 3, Park,J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) Intermediate sequences increase the detection of homology between sequences. J. Mol. Biol., 273, Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, Pearson,W.R. (1995) Comparison of methods for searching protein databases. Protein Sci., 4, Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, Smith,T.F. and Waterman,M.S. (1981) The identification of common molecular subsequences. J. Mol. Biol., 147, Spang,R. and Vingron,M. (1998) Statistics of large scale sequence searching. Bioinformatics, 14, Spang,R., Rehmsmeier,M. and Stoye,S. (2000) Sequence database search using jumping alignments. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp Taylor,W. (1986) Identification of protein sequence homology by consensus template alignment. J. Mol. Biol., 188, Waterman,M.S. and Vingron,M. (1994) Sequence comparison significance and Poisson approximation. Stat. Sci., 9,

Dynamic Programming Algorithms

Dynamic Programming Algorithms Sequence alignments, scores, and significance Lucy Skrabanek ICB, WMC February 7, 212 Sequence alignment Compare two (or more) sequences to: Find regions of conservation