Limits of homology detection by pairwise sequence comparison. Feld 280, Heidelberg, Germany

Size: px
Start display at page:

Download "Limits of homology detection by pairwise sequence comparison. Feld 280, Heidelberg, Germany"

Transcription

1 BIOINFORMATICS Vol. 17 no Pages Limits of homology detection by pairwise sequence comparison Rainer Spang 1,2 and Martin Vingron 1 1 Deutsches Krebsforschungszentrum, Theoretische Bioinformatik, Im Neuenheimer Feld 280, Heidelberg, Germany Received on August 7, 2000; revised on December 18, 2000; accepted on December 21, 2000 ABSTRACT Motivation: Noise in database searches resulting from random sequence similarities increases as the databases expand rapidly. The noise problems are not a technical shortcoming of the database search programs, but a logical consequence of the idea of homology searches. The effect can be observed in simulation experiments. Results: We have investigated noise levels in pairwise alignment based database searches. The noise levels of 38 releases of the SwissProt database, display perfect logarithmic growth with the total length of the databases. Clustering of real biological sequences reduces noise levels, but the effect is marginal. Contact: rainer@stat.duke.edu; m.vingron@dkfz-heidelberg.de INTRODUCTION The current worldwide efforts in determining the DNA sequence of various organisms have led to vast amounts of both nucleotide and amino acid sequence data. Sequence databases already contain up to several hundred megabytes of pure sequence data, and they are expanding very rapidly. It is foreseeable that the size of databases will multiply in the near future. Similarity between molecular sequences is most commonly expressed in terms of alignment scores. In view of possible local similarities, alignment algorithms seek for local alignments, which associate only segments of the two sequences. The currently most widely used method for calculating optimal local alignments is a dynamic programming algorithm developed by Smith and Waterman (1981). For the purpose of database searches, faster programs are available which approximate the optimal Smith Waterman alignment. Among these are the classical programs BLAST (Altschul et al., 1990, 1997) and FASTA (Lipman and Pearson, 1985; Pearson and Lipman, 1988). Although alignment algorithms are designed to detect conserved regions, they produce alignments 2 To whom correspondence should be addressed. Pressent address: Duke University, Institute of Statistics and Decision Sciences, Box Duke University, Durham, NC , USA. and scores also when applied to sequences that are not related at all. In molecular database searches, most comparisons are comparisons between unrelated sequences, and random similarities are frequently observed that score higher than those arising from distant relationships. On average, chance alignment scores are smaller than those resulting from related sequences. However, since there are thousands of chance alignments computed in a database search, the highest random score can exceed the level of scores that arise from distant evolutionary relationships. Therefore, the reliability of an alignment score depends on the size of the database. These random similarities are a major obstacle in molecular database searches. For simplicity we refer to them as noise. Homology search is based on the detection of conserved segments in the sequences. For remotely related proteins, conservation is weak and normally restricted to short segments of the sequences. One often observes that there is no more than 30% identity on segments that are restricted to residues. Consider the following thought experiment: given a real sequence, we construct a second sequence by randomly replacing 70% of the residues of the first sequence by different residues. In this context of random mutations, one would not expect that the two sequence, which are 30% identical, have common structural or functional properties. However, experience shows that most of the genuine sequences with 30% identical positions share structural and functional properties. This paradox can be resolved statistically: the probability that there are two sequences in a database of sequences, with more than 30% identity is very small, if the similarity only results from a chance event. Hence for those pairs of sequence from the database with 30% identity or more, it is likely that this similarity is not by chance, but the result of a common evolutionary history of the two sequences. This common history then indicates common structural and functional features of the sequences. Our view of homology searches is that they are essentially based on this statistical rationale: if a similarity between two sequences is very unlikely to occur by chance, it is concluded that it has an evolutionary 338 c Oxford University Press 2001

2 Limits of homology search basis. However, if one performs more comparisons, the chance to find a high scoring random alignment increases. Consequently, the reliability of a sequence similarity does not only depend on the two sequences themselves, but also on the amount of data that was searched before the similarity was detected. Scores resulting from genuine evolutionary relationships obviously remain constant, while their credibility decreases as the database grows. Since homology searches are based on a statistical rationale, noise is an inherent problem. RESULTS The growth of noise were easily observed in simulation experiments. One thousand random sequences were generated. By random sequences we mean independently identically distributed random sequences, where each position was drawn according to an average amino acid distribution. All sequences are 350 residues long, which is about the average length of a protein. These sequences were used to search the SwissProt database (Bairoch and Apweiler, 1999). Searches were done using the Smith Waterman algorithm with gap costs of 15 for opening a gap and 3 for extending. As a score matrix the PAM250 matrix was used (Dayhoff et al., 1978). For each of the thousand searches, the maximal score was sampled. The experiment was repeated for 38 past releases of the SwissProt database, starting with Rel. 1 and ending with Rel. 38. We measured noise levels for each search. Noise levels were represented by the location parameters of extreme value distributions. All our samples obeyed extreme value distributions. This means that the samples were distributed as θ G + ξ, where G is a random variable satisfying Prob[G < t] = exp( exp( t)). The scale θ of the distributions was almost constant, whereas the location ξ depended on the size of the databases. The location ξ was related to the expected maximal random score E by E = ξ + c where the constant c = θγ is the product of the scale θ and Euler s constant γ This is in agreement with results from purely mathematical models (Arratia et al., 1986; Karlin and Altschul, 1990; Dembo et al., 1994b) The location parameter for each experiment was estimated using the direct estimation procedure (Waterman and Vingron, 1994). In Figure 1, the estimated noise levels are plotted versus the logarithm of the total lengths of the database. By the total length of the database we mean the sum of all lengths of sequences in the database. One can observe a significant increase of the noise from 78.4 score points in 1986 to 96.0 today. In this range of scores, we can find many pairs of remotely related proteins. On the right, the score levels of an arbitrarily chosen set of examples are marked. They illustrate how the noise catches up with real homologies. Location parameter of the noise distribution Rel Rel Rel Rel Rel Rel Logarithm of the total number of residues in the database Example 5 (Zinc finger) Example 4 (Fos) Example 3 (Death Domain) Example 2 (Kinase) Example 1 (Cytochrome) Fig. 1. The noise levels of the SwissProt releases (1 38) are plotted versus the logarithm of the total length of the database. On the right, the score levels of five pairs of distantly related sequences are shown. All pairs share a weakly conserved domain according to the annotations in the Pfam database (18). Example 1 cytochrome c: CYC HUMAN and C55L SYNY3. Example 2 kinases: ADK HUMAN (Adenosine Kinase) and SCRK ECOLI (Fruktokinase). Example 3 proteins that contain a death domain: FASA HUMAN (Apoptosis Mediating Surface Antigen FAS) and RAID HUMAN (Caspase & RIP Adaptor with Death Domain). Example 4 bzip-transcription factors: FOS CHICK (P55 C FOS Proto Oncogene Protein) and CREB RAT (CAMP Response Element Binding Protein). Example 5 zinc fingers: PEXA HUMAN (Proxisome Assembly Protein PEX10) and ME18 HUMAN (DNA Binding Protein MEL 18). Noise has reached the score level of the two cytochromes in 1987, the kinases one year later, and then it outshined the other three real homologies. The second important observation one can make, is that the data displays an almost perfect straight line. This implies that whenever we double the size of a database, random noise increases by a constant value. This observation is consistent with results which can be obtained from purely probabilistic models for sequence matching. Mathematically, alignment problems fit into the context of the Erdös Rényi law (Erdös and Rényi, 1970). For alignments without gaps, it has been shown that H nm /nm const with probability one, where H nm is the score obtained from two random sequences of length n and m respectively (Dembo et al., 1994a). There is no complete proof for the same result in the case of alignments with gaps, but it is conjectured to hold also for these settings (Waterman and Vingron, 1994). If one interprets the query as one sequence and the concatenation of all database entries as the other sequence, this theory also predicts the logarithmic growth that we 339

3 R.Spang and M.Vingron have observed. However, the absolute level of noise is different in simulations with real sequences then it is predicted by the analytical theory (Spang and Vingron, 1998). The theoretical results are based on the assumption, that the sequences are i.i.d. This is the standard model for unrelated sequences. However, biological sequences are different. An important difference is that biological sequences cluster strongly (Krause and Vingron, 1998). From this perspective it comes as a surprise that in spite of the clustering we still observe an almost perfect logarithmic growth of noise. The following simulation experiment shows that clustering slightly reduces noise levels, but the effect is marginal. The Dayhoff model of sequence evolution (Dayhoff et al., 1978) allows for the generation of sequence pairs at a given PAM distance. For a given distance t, 100 triples of sequences (Q i, Si 1, S2 i ) were generated. All sequences have length 350. A triple consists of a cluster (Si 1, S2 i ) of sequences that are t PAM apart and a query Q, which is independent of both S 1 and S 2. The alignment scores H 1 (t) i and H 2 (t) i of Q i to Si 1 and Si 2 were sampled. The experiment was repeated for PAM distances of t = 1, 5, 10, 15, 40, 60, 80, 200, 300, 400, 500. If S 1 and S 2 are identical (t = 0), we also have H 1 (0) = H 2 (0). If they differ in only a few positions, H 1 (t) and H 2 (t) are not equal; however, they are strongly correlated. Clearly, this correlation decreases if S 1 and S 2 become more distant from each other. Figure 2 shows a plot of the PAM distance versus the correlation of scores. One can observe a fast decrease of the correlation with PAM distances growing. For illustration, an exponential line is fitted to the data. In a second step the samples H 1 (t) i and max(h 1 (t) i, H 2 (t) i ) are compared. H 1 (t) i is a sample of alignment scores stemming from independent i.i.d. sequences of length 350 each. Consequently, the sample is extreme value distributed. Its location parameter ξ 1 is estimated using direct estimation. For very large PAM distances, max(hi 1, H i 2) i can be interpreted as a sample of alignment scores between independent i.i.d. sequences of length 350 and 700. Hence, it is also extreme value distributed. Let ξ max denote its estimated location parameter. If S 1 and S 2 are independent, the increase of search space yields a noise location ξ max, that is shifted by about 2.1 score units relative to ξ 1. For smaller PAM distances, this shift is reduced, due to the correlation of H 1 (t) and H 2 (t). In Figure 3, the PAM distance t is plotted versus ξ max ξ 1. The final shift value is already reached for evolutionary distances as small as 40 PAM. Both plots indicate that cluster effects play a minor role in the determination of noise levels in database searches since moderately related sequences already induce the same amount of noise as uncorrelated sequences. This also explains why correlation coefficient PAM Fig. 2. The correlation coefficients of the score samples H 1 (t) i and H 2 (t) i are plotted against the PAM distance of the Sequences S 1 and S 2. An exponential line of the form t exp( βt) is fitted to the data, for visualizing the rapid decrease of correlation coefficients. Location Shift PAM Fig. 3. The location shift between the samples H 1 (t) i and max(h 1 (t) i, H 2 (t) i ) is plotted versus the PAM distance of S 1 and S 2. The dashed line is of the form t α 1/(β t). clustering does not bring the growth of noise to a stop. DISCUSSION We have demonstrated on the example of 38 releases of the SwissProt database, that noise levels in database searches grow on a logarithmic rate with the size of the 340

4 Limits of homology search database. While the clustering of real biological sequence data can theoretically reduce noise levels, we have shown by simulation experiments that this effect is of no practical importance. Database search programs like BLAST and FASTA provide E-values. In its simplest form the E-value is just the p-value p(t) of a score level t, multiplied by the number of sequences in the database D (Altschul et al., 1994). Clearly, E-values are sensitive to the loss of credibility associated with expanding databases. Essentially, the same results are obtained if one interprets a database search as a multiple hypothesis test and uses the Bonferroni correction. In order to have a significance level α for the database search, one needs significance levels of α/d for each comparison. In Spang and Vingron (1998), we suggest that this correction might be too conservative and we recommend the use of p-values based on the effective length of a database for more accurate adjustments. The simulation results in this paper underline the importance of adjusting p-values for multiple comparisons. In fact, the situation in molecular biology should really be corrected not only for the database size, but for all queries against a database over time. If we only correct for database size to restore a 0.05 probability of false positive for a single search, then we still have 1 out of 20 searches leading to a wrong answer. This becomes important in the context of inter- or intra-genome comparisons. Our analysis is purely based on random similarities. This approach is also the rationale of the E-values, known from BLAST and FASTA. Its appeal is that there is no need to fit a model of related sequences, which would require the difficult choice of a representative sample of homologous sequences. However, this might also be seen as a shortcoming, since inference is restricted to probabilities of false positive hits. The problem of false negatives can not be addressed this way. For a comparative evaluation of database search algorithms, (Pearson, 1995; Brenner et al., 1998; Park et al., 1998) both sensitivity and selectivity are important. Consequently, random sequence similarity and the evaluation of algorithms for homology detection are different statistical problems, requiring different methodology. What can be done about the effect of growing noise? Our simulation experiments are restricted to homology search by local alignment. While this is the most commonly used algorithm for database searching, there are alternative strategies. Iterative methods such as intermediate sequence searches (Park et al., 1997), Psi-BLAST (Altschul et al., 1997), SYSTERS (Krause and Vingron, 1998) or SAM-T98 (Karplus et al., 1998) start with an initial standard database search. The findings of the first round of search are then used to construct a protein family specific scoring scheme for a second search. This updating strategy is iterated. A crucial difference to non-iterative search methods is that the scoring scheme changes when databases expand. Clearly, the growth of noise due to random sequence similarity is also a problem for these methods. However, it may be compensated by improved scoring schemes in larger databases. Iterative search methods have been shown to be successful (Park et al., 1998). As databases expand, noise grows and scoring schemes improve. One might also consider building smaller databases, that only contain some representative sequences for each protein family. These databases would be smaller and noise levels would be lower. On the other hand real signals would be weaker too. These tradeoffs have not yet been discussed in the literature. It can not be done by the methods used in this paper, since it must address both sensitivity and selectivity issues. It is a consequence of the Borel Cantelli Lemma, that infinite i.i.d. random sequences contain a perfect match to any given word of finite length with probability 1. The Erdös Rényi law (Erdös and Rényi, 1970) tells us how long we need to wait for this perfect match. And the results in Dembo et al. (1994a) generalize this result to the case of approximate matches represented by high scoring local alignments without gaps. Improving the scoring scheme by alignments with gaps (Smith and Waterman, 1981), profiles (Gribskov et al., 1987; Luthy et al., 1994), templates (Taylor, 1986) hidden Markov models (Krogh et al., 1994; Eddy et al., 1995) or jumping alignments (Spang et al., 2000), means changing the measures of similarity. While this strategy often improves database search results, it clearly can not rule out that there are unrelated sequences in the database that score higher than remote members of the family. The problem of random sequence similarities is a more general one. On one hand, sequence conservation is weak for distantly related sequences, and on the other hand, large datasets are likely to contain sequences that in fact display a stronger similarity to a given sequence although they are not related. ACKNOWLEDGEMENTS We thank two anonymous referees for very helpful comments on an earlier version of this paper. REFERENCES Altschul,S.F., Boguski,M.S., Gish,W. and Wootton,J.C. (1994) Issues in searching molecular sequence databases. Nature Genet., 6, Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhanj,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and Psi- BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25,

5 R.Spang and M.Vingron Arratia,R., Gordon,L. and Waterman,M.S. (1986) An extreme value theory for sequence matching. Ann. Stat., 14, Bairoch,A. and Apweiler,R. (1999) The SwissProt protein sequence data bank and its supplement TREMBL in Nucleic Acids Res., 27, Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, Dayhoff,M., Schwartz,R. and Orcutt,B. (1978) A model of evolutionary change in protein. Atlas of Protein Sequences and Structure, 5. Dembo,A., Karlin,S. and Zeitouni,O. (1994a) Critical phenomena for sequence matching with scoring. Ann. Prob., 22, Dembo,A., Karlin,S. and Zeitouni,O. (1994b) Limit distribution of maximal non-aligned two-sequence segmental score. Ann. Prob., 22, Eddy,S., Mitchison,G. and Durbin,R. (1995) Maximum discrimination hidden Markov models of sequence consensus. J. Comp. Biol., 2, Erdös,P. and Rényi,A. (1970) On a law of large numbers. J. Anal. Math., 22, Gribskov,M., McLachlan,A. and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, Karlin,S. and Altschul,S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA, 87, Karplus,K., Barret,C. and Hughey,R. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14, Krause,A. and Vingron,M. (1998) A set-theoretic approach to database searching and clustering. Bioinformatics, 14, Krogh,A., Brown,M., Mian,I., Sjølander,K. and Haussler,D. (1994) Hidden Markov models in computational biology. Applications to protein modelling. J. Mol. Biol., 235, Lipman,D.J. and Pearson,W.R. (1985) Rapid and sensitive protein similarity searches. Science, 227, Luthy,R., Xenarios,I. and Bucher,P. (1994) Improving the sensitivity of the sequence profile method. Protein Sci., 3, Park,J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) Intermediate sequences increase the detection of homology between sequences. J. Mol. Biol., 273, Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, Pearson,W.R. (1995) Comparison of methods for searching protein databases. Protein Sci., 4, Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, Smith,T.F. and Waterman,M.S. (1981) The identification of common molecular subsequences. J. Mol. Biol., 147, Spang,R. and Vingron,M. (1998) Statistics of large scale sequence searching. Bioinformatics, 14, Spang,R., Rehmsmeier,M. and Stoye,S. (2000) Sequence database search using jumping alignments. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp Taylor,W. (1986) Identification of protein sequence homology by consensus template alignment. J. Mol. Biol., 188, Waterman,M.S. and Vingron,M. (1994) Sequence comparison significance and Poisson approximation. Stat. Sci., 9,

Dynamic Programming Algorithms

Dynamic Programming Algorithms Dynamic Programming Algorithms Sequence alignments, scores, and significance Lucy Skrabanek ICB, WMC February 7, 212 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool 14.06.2010 Table of contents 1 History History 2 global local 3 Score functions Score matrices 4 5 Comparison to FASTA References of BLAST History the program was designed by Stephen W. Altschul, Warren

More information

Database Searching and BLAST Dannie Durand

Database Searching and BLAST Dannie Durand Computational Genomics and Molecular Biology, Fall 2013 1 Database Searching and BLAST Dannie Durand Tuesday, October 8th Review: Karlin-Altschul Statistics Recall that a Maximal Segment Pair (MSP) is

More information

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University Making Sense of DNA and Protein Sequences Lily Wang, PhD Department of Biostatistics Vanderbilt University 1 Outline Biological background Major biological sequence databanks Basic concepts in sequence

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

BIOINFORMATICS. ProtEST: protein multiple sequence alignments from expressed sequence tags

BIOINFORMATICS. ProtEST: protein multiple sequence alignments from expressed sequence tags BIOINFORMATICS Vol. 16 no. 2 2000 Pages 111 116 ProtEST: protein multiple sequence alignments from expressed sequence tags James A. Cuff 1,2, Ewan Birney 3, Michele E. Clamp 2, and Geoffrey J. Barton 2,

More information

Textbook Reading Guidelines

Textbook Reading Guidelines Understanding Bioinformatics by Marketa Zvelebil and Jeremy Baum Last updated: May 1, 2009 Textbook Reading Guidelines Preface: Read the whole preface, and especially: For the students with Life Science

More information

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS*

Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS* COMPUTATIONAL METHODS IN SCIENCE AND TECHNOLOGY 9(1-2) 93-100 (2003/2004) Ab Initio SERVER PROTOTYPE FOR PREDICTION OF PHOSPHORYLATION SITES IN PROTEINS* DARIUSZ PLEWCZYNSKI AND LESZEK RYCHLEWSKI BiolnfoBank

More information

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem. Dec-82 Oct-84 Aug-86 Jun-88 Apr-90 Feb-92 Nov-93 Sep-95 Jul-97 May-99 Mar-01 Jan-03 Nov-04 Sep-06 Jul-08 May-10 Mar-12 Growth of GenBank 160,000,000,000 180,000,000 Introduction to Bioinformatics Iosif

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

Comparative Modeling Part 1. Jaroslaw Pillardy Computational Biology Service Unit Cornell Theory Center

Comparative Modeling Part 1. Jaroslaw Pillardy Computational Biology Service Unit Cornell Theory Center Comparative Modeling Part 1 Jaroslaw Pillardy Computational Biology Service Unit Cornell Theory Center Function is the most important feature of a protein Function is related to structure Structure is

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool J. Mol. Biol. 1990 Oct 5;215(3):403 10. Basic Local Alignment Search Tool Stephen F. Altschul 1, Warren Gish 1, Webb Miller 2 Eugene W. Myers 3 and David J. Lipman 1 1 National Center for Biotechnology

More information

Bioinformation by Biomedical Informatics Publishing Group

Bioinformation by Biomedical Informatics Publishing Group Algorithm to find distant repeats in a single protein sequence Nirjhar Banerjee 1, Rangarajan Sarani 1, Chellamuthu Vasuki Ranjani 1, Govindaraj Sowmiya 1, Daliah Michael 1, Narayanasamy Balakrishnan 2,

More information

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing

More information

Bioinformatic analysis of similarity to allergens. Mgr. Jan Pačes, Ph.D. Institute of Molecular Genetics, Academy of Sciences, CR

Bioinformatic analysis of similarity to allergens. Mgr. Jan Pačes, Ph.D. Institute of Molecular Genetics, Academy of Sciences, CR Bioinformatic analysis of similarity to allergens Mgr. Jan Pačes, Ph.D. Institute of Molecular Genetics, Academy of Sciences, CR Scope of the work Method for allergenicity search used by FAO/WHO Analysis

More information

FramePlus: aligning DNA to protein sequences. Eran Halperin, Simchon Faigler and Raveh Gill-More

FramePlus: aligning DNA to protein sequences. Eran Halperin, Simchon Faigler and Raveh Gill-More BIOINFORMATICS Vol. 15 no. 11 1999 Pages 867 873 FramePlus: aligning DNA to protein sequences Eran Halperin, Simchon Faigler and Raveh Gill-More Compugen Ltd., 72 Pinchas Rosen Street, Tel Aviv, Israel

More information

Sliding Window Plot Figure 1

Sliding Window Plot Figure 1 Introduction Many important control signals of replication and gene expression are found in regions of the molecule with a high concentration of palindromes (e.g., see Masse et al. 1992). Statistical methods

More information

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database

More information

Optimizing multiple spaced seeds for homology search

Optimizing multiple spaced seeds for homology search Optimizing multiple spaced seeds for homology search Jinbo Xu, Daniel G. Brown, Ming Li, and Bin Ma School of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada j3xu,browndg,mli

More information

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU

CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU CAP 5510/CGS 5166: Bioinformatics & Bioinformatic Tools GIRI NARASIMHAN, SCIS, FIU !2 Sequence Alignment! Global: Needleman-Wunsch-Sellers (1970).! Local: Smith-Waterman (1981) Useful when commonality

More information

An Overview of Protein Structure Prediction: From Homology to Ab Initio

An Overview of Protein Structure Prediction: From Homology to Ab Initio An Overview of Protein Structure Prediction: From Homology to Ab Initio Final Project For Bioc218, Computational Molecular Biology Zhiyong Zhang Abstract The current status of the protein prediction methods,

More information

BIOINFORMATICS IN BIOCHEMISTRY

BIOINFORMATICS IN BIOCHEMISTRY BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses on the analysis of molecular sequences (DNA, RNA, and

More information

March 9, Hidden Markov Models and. BioInformatics, Part I. Steven R. Dunbar. Intro. BioInformatics Problem. Hidden Markov.

March 9, Hidden Markov Models and. BioInformatics, Part I. Steven R. Dunbar. Intro. BioInformatics Problem. Hidden Markov. and, and, March 9, 2017 1 / 30 Outline and, 1 2 3 4 2 / 30 Background and, Prof E. Moriyama (SBS) has a Seminar SBS, Math, Computer Science, Statistics Extensive use of program "HMMer" Britney (Hinds)

More information

Evolutionary Genetics. LV Lecture with exercises 6KP

Evolutionary Genetics. LV Lecture with exercises 6KP Evolutionary Genetics LV 25600-01 Lecture with exercises 6KP HS2017 >What_is_it? AATGATACGGCGACCACCGAGATCTACACNNNTC GTCGGCAGCGTC 2 NCBI MegaBlast search (09/14) 3 NCBI MegaBlast search (09/14) 4 Submitted

More information

Two Mark question and Answers

Two Mark question and Answers 1. Define Bioinformatics Two Mark question and Answers Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three

More information

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized

Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized 1 2 3 Imaging informatics computer assisted mammogram reading Clinical aka medical informatics CDSS combining bioinformatics for diagnosis, personalized medicine, risk assessment etc Public Health Bio

More information

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical

More information

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations

An Analytical Upper Bound on the Minimum Number of. Recombinations in the History of SNP Sequences in Populations An Analytical Upper Bound on the Minimum Number of Recombinations in the History of SNP Sequences in Populations Yufeng Wu Department of Computer Science and Engineering University of Connecticut Storrs,

More information

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Vorlesungsthemen Part 1: Background

More information

Identifying Regulatory Regions using Multiple Sequence Alignments

Identifying Regulatory Regions using Multiple Sequence Alignments Identifying Regulatory Regions using Multiple Sequence Alignments Prerequisites: BLAST Exercise: Detecting and Interpreting Genetic Homology. Resources: ClustalW is available at http://www.ebi.ac.uk/tools/clustalw2/index.html

More information

ACMES: fast multiple-genome searches for short repeat sequences with concurrent cross-species information retrieval

ACMES: fast multiple-genome searches for short repeat sequences with concurrent cross-species information retrieval Nucleic Acids Research, 2004, Vol. 32, Web Server issue W649 W653 DOI 10.1093/nar/gkh455 ACMES fast multiple-genome searches for short repeat sequences with concurrent cross-species information retrieval

More information

Match the Hash Scores

Match the Hash Scores Sort the hash scores of the database sequence February 22, 2001 1 Match the Hash Scores February 22, 2001 2 Lookup method for finding an alignment position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s p t a.....

More information

Big picture and history

Big picture and history Big picture and history (and Computational Biology) CS-5700 / BIO-5323 Outline 1 2 3 4 Outline 1 2 3 4 First to be databased were proteins The development of protein- s (Sanger and Tuppy 1951) led to the

More information

Hidden Markov Models. Some applications in bioinformatics

Hidden Markov Models. Some applications in bioinformatics Hidden Markov Models Some applications in bioinformatics Hidden Markov models Developed in speech recognition in the late 1960s... A HMM M (with start- and end-states) defines a regular language L M of

More information

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous. Two proteins sharing a common ancestor are said to be homologs. Homologyoften implies structural

More information

AC Algorithms for Mining Biological Sequences (COMP 680)

AC Algorithms for Mining Biological Sequences (COMP 680) AC-04-18 Algorithms for Mining Biological Sequences (COMP 680) Instructor: Mathieu Blanchette School of Computer Science and McGill Centre for Bioinformatics, 332 Duff Building McGill University, Montreal,

More information

Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana

Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana Motif Discovery from Large Number of Sequences: a Case Study with Disease Resistance Genes in Arabidopsis thaliana Irfan Gunduz, Sihui Zhao, Mehmet Dalkilic and Sun Kim Indiana University, School of Informatics

More information

What I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching!

What I hope you ll learn. Introduction to NCBI & Ensembl tools including BLAST and database searching! What I hope you ll learn Introduction to NCBI & Ensembl tools including BLAST and database searching What do we learn from database searching and sequence alignments What tools are available at NCBI What

More information

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1 BSCI348S Fall 2003 Midterm 1 Multiple Choice: select the single best answer to the question or completion of the phrase. (5 points each) 1. The field of bioinformatics a. uses biomimetic algorithms to

More information

Data Retrieval from GenBank

Data Retrieval from GenBank Data Retrieval from GenBank Peter J. Myler Bioinformatics of Intracellular Pathogens JNU, Feb 7-0, 2009 http://www.ncbi.nlm.nih.gov (January, 2007) http://ncbi.nlm.nih.gov/sitemap/resourceguide.html Accessing

More information

BIOINFORMATICS THE MACHINE LEARNING APPROACH

BIOINFORMATICS THE MACHINE LEARNING APPROACH 88 Proceedings of the 4 th International Conference on Informatics and Information Technology BIOINFORMATICS THE MACHINE LEARNING APPROACH A. Madevska-Bogdanova Inst, Informatics, Fac. Natural Sc. and

More information

From Poisson Approximations to the Blueprint of Life

From Poisson Approximations to the Blueprint of Life From Poisson Approximations to the Blueprint of Life Ming-Ying Leung Division of Mathematics and Statistics University of Texas at San Antonio San Antonio, TX 78249 Outline: DNA sequence Analysis Scan

More information

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST Introduction to Artificial Intelligence Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST Chapter 9 Evolutionary Computation Introduction Intelligence can be defined as the capability of a system to

More information

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES

BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES We sequenced and assembled a genome, but this is only a long stretch of ATCG What should we do now? 1. find genes What are the starting and end points for

More information

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data

Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data http://www.psi.toronto.edu Iterated Conditional Modes for Cross-Hybridization Compensation in DNA Microarray Data Jim C. Huang, Quaid D. Morris, Brendan J. Frey October 06, 2004 PSI TR 2004 031 Iterated

More information

Methods and tools for exploring functional genomics data

Methods and tools for exploring functional genomics data Methods and tools for exploring functional genomics data William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington Outline Searching for

More information

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station

Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station Annotation and the analysis of annotation terms. Brian J. Knaus USDA Forest Service Pacific Northwest Research Station 1 Library preparation Sequencing Hypothesis testing Bioinformatics 2 Why annotate?

More information

GeConT: gene context analysis

GeConT: gene context analysis Bioinformatics Advance Access published April 8, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. GeConT: gene context analysis Ciria R, Abreu-Goodger C., Morett E. *, and Merino

More information

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 11, 2011 1 1 Introduction Grundlagen der Bioinformatik Summer 2011 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a) 1.1

More information

Query-seeded iterative sequence similarity searching improves selectivity 5 20-fold

Query-seeded iterative sequence similarity searching improves selectivity 5 20-fold Published online 6 December 2016 Nucleic Acids Research, 2017, Vol. 45, No. 7 e46 doi: 10.1093/nar/gkw1207 Query-seeded iterative sequence similarity searching improves selectivity 5 20-fold William R.

More information

Query-seeded iterative sequence similarity searching improves selectivity 5 20-fold

Query-seeded iterative sequence similarity searching improves selectivity 5 20-fold Nucleic Acids Research, 2016 1 doi: 10.1093/nar/gkw1207 Query-seeded iterative sequence similarity searching improves selectivity 5 20-fold William R. Pearson 1,*, Weizhong Li 2 and Rodrigo Lopez 2 1 Dept.

More information

DNA Sequence Alignment based on Bioinformatics

DNA Sequence Alignment based on Bioinformatics DNA Sequence Alignment based on Bioinformatics Shivani Sharma, Amardeep singh Computer Engineering,Punjabi University,Patiala,India Email: Shivanisharma89@hotmail.com Abstract: DNA Sequence alignmentis

More information

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences.

Question 2: There are 5 retroelements (2 LINEs and 3 LTRs), 6 unclassified elements (XDMR and XDMR_DM), and 7 satellite sequences. Bio4342 Exercise 1 Answers: Detecting and Interpreting Genetic Homology (Answers prepared by Wilson Leung) Question 1: Low complexity DNA can be described as sequences that consist primarily of one or

More information

Exploring Long DNA Sequences by Information Content

Exploring Long DNA Sequences by Information Content Exploring Long DNA Sequences by Information Content Trevor I. Dix 1,2, David R. Powell 1,2, Lloyd Allison 1, Samira Jaeger 1, Julie Bernal 1, and Linda Stern 3 1 Faculty of I.T., Monash University, 2 Victorian

More information

Data Mining for Biological Data Analysis

Data Mining for Biological Data Analysis Data Mining for Biological Data Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Data Mining Course by Gregory-Platesky Shapiro available at www.kdnuggets.com Jiawei Han

More information

Bioinformatics Practical Course. 80 Practical Hours

Bioinformatics Practical Course. 80 Practical Hours Bioinformatics Practical Course 80 Practical Hours Course Description: This course presents major ideas and techniques for auxiliary bioinformatics and the advanced applications. Points included incorporate

More information

Optimizing Genetic Algorithm Parameters for Multiple Sequence Alignment Based on Structural Information

Optimizing Genetic Algorithm Parameters for Multiple Sequence Alignment Based on Structural Information Advanced Studies in Biology, Vol. 8, 2016, no. 1, 9-16 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/asb.2016.51250 Optimizing Genetic Algorithm Parameters for Multiple Sequence Alignment Based

More information

Sequence Based Function Annotation

Sequence Based Function Annotation Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation 1. Given a sequence, how to predict its biological

More information

A Hidden Markov Model for Identification of Helix-Turn-Helix Motifs

A Hidden Markov Model for Identification of Helix-Turn-Helix Motifs A Hidden Markov Model for Identification of Helix-Turn-Helix Motifs CHANGHUI YAN and JING HU Department of Computer Science Utah State University Logan, UT 84341 USA cyan@cc.usu.edu http://www.cs.usu.edu/~cyan

More information

ProGen: GPHMM for prokaryotic genomes

ProGen: GPHMM for prokaryotic genomes ProGen: GPHMM for prokaryotic genomes Sharad Akshar Punuganti May 10, 2011 Abstract ProGen is an implementation of a Generalized Pair Hidden Markov Model (GPHMM), a model which can be used to perform both

More information

Application of the Scan Statistic in DNA Sequence Analysis

Application of the Scan Statistic in DNA Sequence Analysis Application of the Scan Statistic in DNA Sequence Analysis Ming-Ying Leung Division of Mathematics and Statistics University of Texas at San Antonio San Antonio, TX 78249 Traci E. Yamashita Johns Hopkins

More information

CHAPTER 21 LECTURE SLIDES

CHAPTER 21 LECTURE SLIDES CHAPTER 21 LECTURE SLIDES Prepared by Brenda Leady University of Toledo To run the animations you must be in Slideshow View. Use the buttons on the animation to play, pause, and turn audio/text on or off.

More information

G4120: Introduction to Computational Biology

G4120: Introduction to Computational Biology G4120: Introduction to Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Lecture 3 February 13, 2003 Copyright 2003 Oliver Jovanovic, All Rights Reserved. Bioinformatics

More information

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs.

Comparative Genomics. Page 1. REMINDER: BMI 214 Industry Night. We ve already done some comparative genomics. Loose Definition. Human vs. Page 1 REMINDER: BMI 214 Industry Night Comparative Genomics Russ B. Altman BMI 214 CS 274 Location: Here (Thornton 102), on TV too. Time: 7:30-9:00 PM (May 21, 2002) Speakers: Francisco De La Vega, Applied

More information

Exploring Similarities of Conserved Domains/Motifs

Exploring Similarities of Conserved Domains/Motifs Exploring Similarities of Conserved Domains/Motifs Sotiria Palioura Abstract Traditionally, proteins are represented as amino acid sequences. There are, though, other (potentially more exciting) representations;

More information

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005

Following text taken from Suresh Kumar. Bioinformatics Web - Comprehensive educational resource on Bioinformatics. 6th May.2005 Bioinformatics is the recording, annotation, storage, analysis, and searching/retrieval of nucleic acid sequence (genes and RNAs), protein sequence and structural information. This includes databases of

More information

A History of Bioinformatics: Development of in silico Approaches to Evaluate Food Proteins

A History of Bioinformatics: Development of in silico Approaches to Evaluate Food Proteins A History of Bioinformatics: Development of in silico Approaches to Evaluate Food Proteins /////////// Andre Silvanovich Ph. D. Bayer Crop Sciences Chesterfield, MO October 2018 Bioinformatic Evaluation

More information

Protein function prediction using sequence motifs: A research proposal

Protein function prediction using sequence motifs: A research proposal Protein function prediction using sequence motifs: A research proposal Asa Ben-Hur Abstract Protein function prediction, i.e. classification of protein sequences according to their biological function

More information

Sequence Databases and database scanning

Sequence Databases and database scanning Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases.

More information

Structural Analysis of the EGR Family of Transcription Factors: Templates for Predicting Protein DNA Interactions

Structural Analysis of the EGR Family of Transcription Factors: Templates for Predicting Protein DNA Interactions Introduction Structural Analysis of the EGR Family of Transcription Factors: Templates for Predicting Protein DNA Interactions Jamie Duke, Rochester Institute of Technology Mentor: Carlos Camacho, University

More information

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE

ROAD TO STATISTICAL BIOINFORMATICS CHALLENGE 1: MULTIPLE-COMPARISONS ISSUE CHAPTER1 ROAD TO STATISTICAL BIOINFORMATICS Jae K. Lee Department of Public Health Science, University of Virginia, Charlottesville, Virginia, USA There has been a great explosion of biological data and

More information

Lecture 10. Ab initio gene finding

Lecture 10. Ab initio gene finding Lecture 10 Ab initio gene finding Uses of probabilistic sequence Segmentation models/hmms Multiple alignment using profile HMMs Prediction of sequence function (gene family models) ** Gene finding ** Review

More information

3D Structure Prediction with Fold Recognition/Threading. Michael Tress CNB-CSIC, Madrid

3D Structure Prediction with Fold Recognition/Threading. Michael Tress CNB-CSIC, Madrid 3D Structure Prediction with Fold Recognition/Threading Michael Tress CNB-CSIC, Madrid MREYKLVVLGSGGVGKSALTVQFVQGIFVDEYDPTIEDSY RKQVEVDCQQCMLEILDTAGTEQFTAMRDLYMKNGQGFAL VYSITAQSTFNDLQDLREQILRVKDTEDVPMILVGNKCDL

More information

MOL204 Exam Fall 2015

MOL204 Exam Fall 2015 MOL204 Exam Fall 2015 Exercise 1 15 pts 1. 1A. Define primary and secondary bioinformatical databases and mention two examples of primary bioinformatical databases and one example of a secondary bioinformatical

More information

Prediction of noncoding RNAs with RNAz

Prediction of noncoding RNAs with RNAz Prediction of noncoding RNAs with RNAz John Dzmil, III Steve Griesmer Philip Murillo April 4, 2007 What is non-coding RNA (ncrna)? RNA molecules that are not translated into proteins Size range from 20

More information

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources National Center for Biotechnology Information About NCBI NCBI at a Glance A Science Primer Human Genome Resources Model Organisms Guide Outreach and Education Databases and Tools News About NCBI Site Map

More information

Important points from last time

Important points from last time Important points from last time Subst. rates differ site by site Fit a Γ dist. to variation in rates Γ generally has two parameters but in biology we fix one to ensure a mean equal to 1 and the other parameter

More information

Extracting Database Properties for Sequence Alignment and Secondary Structure Prediction

Extracting Database Properties for Sequence Alignment and Secondary Structure Prediction Available online at www.ijpab.com ISSN: 2320 7051 Int. J. Pure App. Biosci. 2 (1): 35-39 (2014) International Journal of Pure & Applied Bioscience Research Article Extracting Database Properties for Sequence

More information

MetaGO: Predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping

MetaGO: Predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping MetaGO: Predicting Gene Ontology of non-homologous proteins through low-resolution protein structure prediction and protein-protein network mapping Chengxin Zhang, Wei Zheng, Peter L Freddolino, and Yang

More information

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis

BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology. Lecture 2: Microarray analysis BIOINF/BENG/BIMM/CHEM/CSE 184: Computational Molecular Biology Lecture 2: Microarray analysis Genome wide measurement of gene transcription using DNA microarray Bruce Alberts, et al., Molecular Biology

More information

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction

Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction Structural Bioinformatics (C3210) Conformational Analysis Protein Folding Protein Structure Prediction Conformational Analysis 2 Conformational Analysis Properties of molecules depend on their three-dimensional

More information

In 1996, the genome of Saccharomyces cerevisiae was completed due to the work of

In 1996, the genome of Saccharomyces cerevisiae was completed due to the work of Summary: Kellis, M. et al. Nature 423,241-253. Background In 1996, the genome of Saccharomyces cerevisiae was completed due to the work of approximately 600 scientists world-wide. This group of researchers

More information

Single alignment: FASTA. 17 march 2017

Single alignment: FASTA. 17 march 2017 Single alignment: FASTA 17 march 2017 FASTA is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985.[1] FASTA is pronounced

More information

Correcting Sampling Bias in Structural Genomics through Iterative Selection of Underrepresented Targets

Correcting Sampling Bias in Structural Genomics through Iterative Selection of Underrepresented Targets Correcting Sampling Bias in Structural Genomics through Iterative Selection of Underrepresented Targets Kang Peng Slobodan Vucetic Zoran Obradovic Abstract In this study we proposed an iterative procedure

More information

Finding Compensatory Pathways in Yeast Genome

Finding Compensatory Pathways in Yeast Genome Finding Compensatory Pathways in Yeast Genome Olga Ohrimenko Abstract Pathways of genes found in protein interaction networks are used to establish a functional linkage between genes. A challenging problem

More information

MATH 5610, Computational Biology

MATH 5610, Computational Biology MATH 5610, Computational Biology Lecture 2 Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology p.1/24 Announcements Error on syllabus Class

More information

1.1 What is bioinformatics? What is computational biology?

1.1 What is bioinformatics? What is computational biology? Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 16, 2006 3 1 Introduction 1.1 What is bioinformatics? What is computational biology? Bioinformatics and computational biology are multidisciplinary

More information

Article A Teaching Approach From the Exhaustive Search Method to the Needleman Wunsch Algorithm

Article A Teaching Approach From the Exhaustive Search Method to the Needleman Wunsch Algorithm Article A Teaching Approach From the Exhaustive Search Method to the Needleman Wunsch Algorithm Zhongneng Xu * Yayun Yang Beibei Huang, From the Department of Ecology, Jinan University, Guangzhou 510632,

More information

AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM

AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM AN IMPROVED ALGORITHM FOR MULTIPLE SEQUENCE ALIGNMENT OF PROTEIN SEQUENCES USING GENETIC ALGORITHM Manish Kumar Department of Computer Science and Engineering, Indian School of Mines, Dhanbad-826004, Jharkhand,

More information

G4120: Introduction to Computational Biology

G4120: Introduction to Computational Biology ICB Fall 2004 G4120: Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2004 Oliver Jovanovic, All Rights Reserved. Analysis of Protein Sequences Coding

More information

IMPROVING SEQUENCE ALIGNMENTS FOR INTRINSICALLY DISORDERED PROTEINS

IMPROVING SEQUENCE ALIGNMENTS FOR INTRINSICALLY DISORDERED PROTEINS IMPROVING SEQUENCE ALIGNMENTS FOR INTRINSICALLY DISORDERED PROTEINS PREDRAG RADIVOJAC, ZORAN OBRADOVIC Center for Information Science and Technology, Temple University, U. S. A. CELESTE J. BROWN, A. KEITH

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools CAP 5510: Introduction to Bioinformatics : Bioinformatics Tools ECS 254A / EC 2474; Phone x3748; Email: giri@cis.fiu.edu My Homepage: http://www.cs.fiu.edu/~giri http://www.cs.fiu.edu/~giri/teach/bioinfs15.html

More information

Why Use BLAST? David Form - August 15,

Why Use BLAST? David Form - August 15, Wolbachia Workshop 2017 Bioinformatics BLAST Basic Local Alignment Search Tool Finding Model Organisms for Study of Disease Can yeast be used as a model organism to study cystic fibrosis? BLAST Why Use

More information

Profile-Profile Alignment: A Powerful Tool for Protein Structure Prediction. N. von Öhsen, I. Sommer, R. Zimmer

Profile-Profile Alignment: A Powerful Tool for Protein Structure Prediction. N. von Öhsen, I. Sommer, R. Zimmer Profile-Profile Alignment: A Powerful Tool for Protein Structure Prediction N. von Öhsen, I. Sommer, R. Zimmer Pacific Symposium on Biocomputing 8:252-263(2003) PROFILE-PROFILE ALIGNMENT: A POWERFUL TOOL

More information

The application of hidden markov model in building genetic regulatory network

The application of hidden markov model in building genetic regulatory network J. Biomedical Science and Engineering, 2010, 3, 633-637 doi:10.4236/bise.2010.36086 Published Online June 2010 (http://www.scirp.org/ournal/bise/). The application of hidden markov model in building genetic

More information

Advanced topics in bioinformatics

Advanced topics in bioinformatics Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib

More information

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases.

What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases. What is Bioinformatics? Bioinformatics is the application of computational techniques to the discovery of knowledge from biological databases. Bioinformatics is the marriage of molecular biology with computer

More information

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar

Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar Gene Prediction Introduction Protein-coding gene prediction RNA gene prediction Modification

More information

In silico analysis of complete bacterial genomes: PCR, AFLP-PCR, and endonuclease

In silico analysis of complete bacterial genomes: PCR, AFLP-PCR, and endonuclease Bioinformatics Advance Access published January 29, 2004 In silico analysis of complete bacterial genomes: PCR, AFLP-PCR, and endonuclease restriction Joseba Bikandi*, Rosario San Millán, Aitor Rementeria,

More information

Challenging algorithms in bioinformatics

Challenging algorithms in bioinformatics Challenging algorithms in bioinformatics 11 October 2018 Torbjørn Rognes Department of Informatics, UiO torognes@ifi.uio.no What is bioinformatics? Definition: Bioinformatics is the development and use

More information