Computational Molecular Biology. Lecture Notes. by A.P. Gultyaev

Size: px

Start display at page:

Download "Computational Molecular Biology. Lecture Notes. by A.P. Gultyaev"

Edwin Cox
6 years ago
Views:

1 Computational Molecular Biology Lecture Notes by A.P. Gultyaev Leiden Institute of Applied Computer Science (LIACS) Leiden University January

2 Contents Introduction Sequence databases Nucleotide sequence databases Protein sequence databases Sequence alignment Pairwise alignment Sequence database similarity search Multiple sequence alignment Sequence motifs, patterns and profile analysis Gene finding and genome annotation Search by signal Search by content Probabilistic models Similarity-based search Genome annotation strategies Protein structure prediction Homology modeling Fold recognition Ab initio protein structure prediction Combinations of approaches Predictions of coiled coil domains and transmembrane segments Inverse folding and protein design RNA structure prediction Thermodynamics of RNA folding and RNA structure predictions Comparative RNA analysis Detecting conserved structures in related RNA sequences RNA motif search Recommended reading Exercises

3 Introduction The course Computational Molecular Biology aims to make participants familiar with a number of widely used bioinformatic approaches. The purpose of the course is not to make a guided tour along the available resources in the Internet such as databases, programs etc. but rather to give an idea about the main principles behind efficient algorithms for genomic data analysis. Of course, the text below is just some (lecture) notes that may assist in understanding the material given in lectures and should not be considered as independent comprehensive description. This material is accompanied by exercises with tasks to be accomplished using a number of bioinformatic resources, in order to get hands-on experience. Basic Molecular Biology Introduction: Biopolymers (DNA, RNA, proteins) are the main components of life on Earth. DNA and RNA molecules are the polynucleotide chains containing four nucleotides, protein polypeptide chains contain 20 residue types (amino acid residues). The main "dogma" of Molecular Biology states that DNA is a carrier of genetic information, which is transcribed into messenger RNA (mrna) molecules that are translated to proteins responsible for functioning and reproduction of organisms. Analysis of monomer sequences of biopolymers is of great importance for understanding all living systems, from viruses to humans. Computational analysis of sequence data involves large databases and complex algorithms. The sequence analysis algorithms are developed for sequence comparisons, recognition of functional sequence patterns and processing of experimental sequencing data. Sequence analysis is the core of bioinformatics. Genes carrying genetic information in the majority of species (except some viruses) are the regions of double-helical DNA genomes containing two complementary antiparallel polynucleotide strands that have a so-called 5'->3' polarity (following the nomenclature of atoms in monomer units). Each monomer unit (nucleotide) contains phosphate, sugar (desoxyribose) and one of nucleotide bases: adenine (A), thymine(t), guanine (G) or cytosine (C). The complementarity is determined by Watson-Crick base-pairing in AT and GC pairs. Thus a double-helical DNA molecule is described by the sequence of one of its strands, e.g. TAGCGCAGGG... (in this case the complementary strand in this fragment is...ccctgcgcta). DNA replication mechanism ensures the transfer of genetic information upon reproduction, sequence changes occurring due to replication errors (mutations) may lead to disorders or evolutionary development. Transcription of DNA into RNA is carried out by RNA polymerase complex which uses one of DNA strands as a template. The resulting RNA sequence is determined by Watson-Crick complementarity rules and thus is the copy of complementary DNA strand sequence, except that thymine is substituted by another nucleotide base, uracil (U). Start and termination points of transcription are important elements of gene sequence analysis. Sequence signals that determine transcription starts, are called promoters. Apart from a protein-coding region, which is translated to amino acid sequence, an mrna molecule contains two untranslated regions "before" (upstream) and "after" (downstream) of the coding sequence. These regions are called 5' untranslated region (5'UTR) and 3'UTR, respectively. Translation of a nucleotide sequence in an amino acid sequence occurs according to the genetic code, with amino acids encoded by triplets of nucleotides (codons). Genetic code is redundant, as the majority of amino acids can be encoded by different codons, frequently with the last 3rd position being so-called "wobble" one. Start-codons are ATG sequences (or AUG in RNA, encoding for methionine amino acid), stop-codons are TAA, TAG and TGA. The translated region between a startand stop-codon is called open reading frame (ORF). Due to the triplet coding, for any DNA sequence 6 reading frames are possible: 3 frames in each of the complementary strands. In prokaryotic organisms (those lacking nucleus in their cells), an mrna molecule may contain an operon consisting of several ORFs coding for several proteins. In eukaryotes, genes have different complexity: mrna molecules with ORFs are not transcribed directly, but are produced by processing of precursor pre-mrna molecules. In the processing (RNA splicing), non-coding pre-mrna regions (introns) are removed, and the rest (exons) are ligated, leading to mature mrna molecules. A number of eukaryotic genes are characterised by alternative splicing, leading to various exon combinations in mrnas and, therefore, different encoded proteins. Biopolymer sequences contain multiple sequence patterns that regulate their functions. Furthermore, these functions depend on 3D structures of biopolymer molecules. Prediction and analysis of the 3D structures on the basis of onedimensional sequences is also an essential part of bioinformatic sequence analysis. 3

4 1. Sequence databases The databases of protein amino acid sequences have appeared before nucleotide databases. The Atlas of Protein Sequences and Structures was published in The development of efficient DNA sequencing techniques resulted in the explosion of the information on nucleotide sequences. In the early 1980 s three major nucleotide sequence databases have been created: European EMBL nucleotide sequence database, American GenBank and Japanese DDBJ. In 1988 an agreement of a common format has been achieved. Nowadays, the three databases exchange all sequences. However, any update of a sequence entry is performed by a database that owns the entry from the moment of direct submission, to prevent conflicts of updates. The main elementary unit of sequence databases is a flatfile representing a single sequence entry. Although the database management systems (DBMS) of modern sequence databases may invoke relational database management systems (DBMS), flatfile format remains a convenient format of exchange between GenBank, EMBL and DDBJ databases. Some small (mainly syntax) differences between flatfiles of the three databases do exist, for instance, linetype prefixes at every line of EMBL files. For an example of sequence entry flatfile see Sample GenBank Record, The main three parts of the GenBank entry are the header (description of the record), features (annotations on the record) and the nucletide sequence itself. These main parts contain various datafields. The header contains the description of sequence including data such as names of organism, gene name, sequence length, keywords, literature reference data. A very important part of the header is identifiers that are used for searching the entries. Two identifiers are used: Accession and GI numbers. An accession number is unique for a sequence record. It does not change, even if information in the record is changed (e.g. sequence correction), but version number is assigned (U ; U ). GI-numbers (GenInfo) identifiers are unique for every sequence: if a sequence changes, a new GI-number is assigned. Features describe the biological information relevant to the sequence (annotation), such as encoded proteins, RNA sequences, various signals such as protein binding sites, mrna, exons and introns etc. Many of the datafields in GenBank sequence entry provide links to the relevant entries of other databases such as literature (e.g. PubMed) or protein databases. These are so-called hard links (between entries from different databases that share the same datafields) of the ENTREZ retrieval system. ENTREZ is maintained by American National Center for Biotechnology Information (NCBI) and provides unified access to various databases and resources. In addition to the access to the primary database of nucleotide sequences (to which data is submitted by scientific community), ENTREZ provides the access to various curated databases such as RefSeq (non-redundant reference sequences), ENTREZ Gene (gene-centered information) etc. ENTREZ has an unified query system for all databases, with useful options such as searching in specific datafields or using Boolean operators. One of the oldest and very popular protein sequence databases is SWISS-PROT. It was created in 1987 by collaboration of University of Geneva and EMBL. The database is accessible via both Entrez (NCBI) and the EMBL - European Bioinformatics Institute (EMBL-EBI) systems. It can be called a secondary (curated) database: it adds value to what is already present in a primary database (nucleotide sequence database). A very small proportion of amino acid sequences is directly submitted to SWISS-PROT. The data in SWISS-PROT sequence file may be divided into the core data and annotation. The core contains a sequence (single letter coding), bibliography and taxonomic information. The annotation includes function(s) of the protein, domains, specific sites etc. In addition to SWISS-PROT, the protein entries in the Entrez retrieval system have been compiled from several databases, and also include translations from annotated coding regions in GenBank and RefSeq. ENTREZ Protein database entries provide hard links to relevant entries from other databases. Furthermore, ENTREZ system provides the links to the protein sequences similar to the one contained in the accessed entry (a concept of neighboring entries in ENTREZ). It should be noted that some protein-coding sequences (open reading frames, ORFs) may remain unnoticed (unannotated). A popular resource of ENTREZ is the ORF Finder with relatively simple algorithm searching for potential ORFs in a given nucleotide sequence in all 6 reading frames (including complementary strand). A similar tool is provided by EMBL-EBI: EMBOSS Transeq ( 4

5 2. Sequence alignment In general, alignment of sequences can be defined as mapping between the residues of two or more sequences, that is, finding the residues that are related to each other due to phylogenetic or functional reasons. One of the main goals of sequence alignment is to determine whether sequences display sufficient similarity to conclude about their homology. However, similarity is not equivalent of homology. Similarity (of sequences) is a quantity that might be measured in e.g. percent identity. Homology means a common evolutionary history. In principle, all alignment methods try to follow the molecular mechanisms of sequence evolution, namely substitutions, deletions and insertions of monomers: AGTCA AGTCA AGT-CA!!! AGACA AG-CA AGTACA (substitution) (deletion) (insertion) An alignment that spans complete sequences is called global alignment, an alignment of sequence fragments - local alignment. Depending on a biological problem, either global or local alignment is more efficient to detect a similarity between sequences. For instance, global alignments are advisable in case of relatively closely related sequences. Local alignments are a powerful tool to localize the most important regions of similarity, e.g. functional motifs, that may be flanked by variable regions that are less important for functioning. Obviously, local alignments should be applied in cases of mosaic arrangement of protein domains or for comparisons between spliced mrnas and genomic sequences. Alignment tasks can be subdivided into pairwise alignment (two sequences), multiple alignment (more than two) and database similarity search (searching for the sequences in the database that are similar to a given sequence) Pairwise alignment For two sequences, many different alignments are possible (for instance, about different alignments for two sequences of length 300). Obviously, an alignment with relatively high number of matching residues in two sequences and relatively small number of deletions/insertions is the most likely to have biological meaning, because it assumes smaller number of evolutionary events. The problem to find such an alignment is equivalent to finding the best path in a dot matrix, where two sequences are plotted at the coordinates of a two-dimensional graph and dots indicate matching residues. Qualitatively, the best global alignment would be a path visiting as many dots as possible with a preference for diagonal moves as compared to horizontal or vertical moves (because the latter represent deletions/insertions in the sequences). Dot matrices are also very demonstrative for visualizing alignments, especially after filtering local regions of low similarity, for instance, leaving only diagonals of consecutive similar residues. Sequence 2 Sequence 2 A G C T A G G A G A G C T A G G A G A... A. G.... G. C. C. G.... G. Seq.1 G.... G. A... A. G.... G.. A... A. G.... G. The example shown above suggests two alternative alignments: AGCTAGGAG AGCTAGGAG-- seq.1 or AGCGGAGAG AGC--GGAGAG seq.2 (filtering out similarities of less than 3 consecutive residues) The second alignment has more matching residues than the first one, but it contains deletions. Obviously, the selection of the best alignment depends on scoring system for matches, mismatches and deletions/insertions. Thus, finding the optimal alignment requires: (1) a definition of scores for matches, mismatches and deletions/insertions; (2) an algorithm to find the best score. 5

6 2.1.1 Substitution matrices A simple match/mismatch scoring (such as 1 for a match, 0 for mismatch) is not the most effective, especially in proteins, where conservative substitutions of amino acids with similar properties are relatively frequent. For instance, substitutions of functionally important polar residues by polar ones or mutual substitutions of hydrophobic residues, e.g. Arg Lys or Val Ile. Therefore, the idea of substitution matrix (Dayhoff et al., 1978) has been introduced. Such a matrix (dimensions with diagonal symmetry) defines different scores for residue matches and all possible substitutions as well. The score values can be estimated from the previously aligned sequences with trusted alignments. An efficient way to introduce such scores is a log-odds calculation: a score proportional to the logarithm of the ratio of observed substitution frequency (target frequency) to the background frequency which is determined by chance. Thus, the score for a substitution of two residues a and b is S(a,b) ~ log [ p ab observed / p ab background ] = log [ p ab observed / f a f b ], where p ab observed is the probability (frequency) of a/b substitution observed in used alignments, and f a and f b are the amino acid occurencies. The logarithm is used for calculations because adding such scores for positions in any new alignment would be equivalent to estimating a likelihood of the hypothesis that the alignment reflects sequence homology and has a non-random pattern (this likelihood is equal to the product of likelihoods of non-random aligning of all residue pairs, if the pairs are statistically independent). The first widely used substitution matrices were based on the point-accepted-mutation (PAM) evolutionary model. In this model, a unit of protein divergence (1 PAM) was introduced, corresponding to a change of 1% of amino acid sequence. It is very important to note that 100 PAM change does not lead to a complete change of the sequence, because some residues may change several times while others will not be substituted at all. Alignments of very closely related sequences (trusted alignments in any scoring system) were used to estimate the target frequencies corresponding to 1 PAM. These values can be extrapolated to more distant sequences, e.g. PAM250 (the matrix originally published by M. Dayhoff et al., 1978). A different strategy was used for so-called BLOSUM matrices. They were derived from alignments contained in the database BLOCKS of local multiple alignments. Here the frequencies were computed directly from alignments. In contrast to PAM model, relatively distant sequences were considered, to avoid statistical biases due to overrepresentation of some sequences. Thus, BLOSUM62 matrix has been computed from the sequences having maximum 62% identity (sequences with higher identities were merged into single strings). The performance of BLOSUM62 turned out to be very good, and this is a standard matrix for many protein alignment programs. The BLOSUM62 scoring matrix: A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V 6

7 For DNA, the 4 4 matrices are relatively more simple, for instance, BLASTN program is using a scoring system with the score +2 for a match, and -1 for a mismatch. Sometimes the differences between more frequent purine-purine or pyrimidine-pyrimidine transitions (A G, C T ) as compared to transversions (A C, A T, G T etc.) are taken into account. For instance, a matrix used in BLASTZ, a program for comparisons of large genomic sequences: A C G T A C G T Gap penalties Locations of insertions/deletions in the alignments are called gaps. In attempt to minimize the number of gaps, some negative scores (penalties) are introduced. The penalty values are subtracted from the scores calculated in non-gapped regions of alignments. The most common formula for a gap penalty is S(gap) = G + L. n, where G is the gap-opening penalty (also called gap existence cost), L is the gap-extension penalty (gap extension cost) and n is the number of residues in the gap (the gap length). There is no solid theory here and the choice of values is highly empirical. In the combination with BLOSUM62 substitution matrix, the values of G around and L around 1-2 turned out to perform rather well Algorithms for finding the optimal alignment The problem of finding the optimal alignment of two sequences (given a defined scoring system) has been solved using a dynamic programming algorithm. Dynamic programming approach is used in various complex optimization problems when solution of a problem can be reduced to a subproblem solution with some recursive calculation for the problem solution on the basis of the subproblem. In case of the alignment problem, formulated as finding the best path in a matrix (using scores for matches/mismatches and gap penalties), a very important observation is that any partial subpath of the optimal one is also the optimal path for the alignment of two subsequences. The main recursive definition for alignment of two (sub)sequences of the length m and n residues is as follows: the optimal score S(m,n) is the best value out of three scores: S(m-1, n-1) + (score of aligning residues m and n), S(m-1, n) + (gap penalty), S(m, n-1) + (gap penalty). Using this definition is possible to both build a matrix of the best scores for all possible subalignments (called dynamic programming matrix) and to backtrack the optimal alignment for the full sequences. The dynamic programming matrix is filled starting from the smallest subsequences using bottom-up approach. After completing the matrix, it is possible to restore the sequence of calculations that has led to the final score. 7

8 An example (from S. Eddy, 2004): two DNA sequences X = TTCATA and Y = TGCTCGTA. The scores: +5 for a match, -2 for a mismatch, -6 for each insertion or deletion. The dynamic programming matrix is filled as follows: 0 T G C T C G T A (Y) [alignment to initialising 0] T [ S(1,1)=5 (T-T match) ] T X C other values A correspond to gaps T A S(2,2) = max (5-2; -1-6; -1-6) = 3 S(2,3) = max (-1-2; 3-6; -7-6) = -3 S(2,4) = max (-7+5; -3-6; -13-6) = -2 etc. S(3,3) = max (3+5; -3-6; -3-6) = 8 S(3,4) = max (-3-2; 8-6; -2-6) = 2 S(3,5) = max (-2+5; 2-6; -8-6) = 3 etc. 0 T G C T C G T A T T C A T A complete dynamic programming matrix: 0 T G C T C G T A T T C A T A The final best score S(6,8)=11 has been computed (backtracking) as follows: S(6,8) S(5,7) S(4,6) S(3,5) S(2,4) S(1,3) S(1,2) S(1,1) (+5) (+5) (-2) (+5) (+5) (-6) (-6) The backtracking allows one to reconstruct the optimal alignment: T--TCATA sequence X TGCTCGTA sequence Y Such an algorithm, given a scoring system, guarantees to find the optimal global alignment of two sequences. The algorithm is often called Needleman-Wunsch algorithm by names of the authors (1970). An extension of this strategy, the Smith-Waterman algorithm (1981) has been developed to find the optimal local alignment. A locally optimal alignment is defined as the one that cannot be improved either by extending or shortening the alignment regions. Both global and local optimal alignments can be computed, for instance, by a tool align ( algorithms needle and water. 8

9 Estimates of statistical significance for alignments Given a scoring system and a (dynamic programming) algorithm for optimal alignment, any two sequences can be aligned. However, there is no guarantee that this alignment is a consequence of homology between the sequences. Apparently, this is likely to be true if the obtained alignment would be very unlikely to be determined by chance alone. Therefore, an estimate of statistical significance is very important. For global alignments, there is no mathematical theory to estimate a score that can be expected by chance. Significance of a global alignment can be estimated by comparison with statistics yielded by alignments of randomised (permuted) sequences. For instance, in order to estimate a significance for the alignment of two sequences X and Y of lengths m and n residues, respectively, one can generate 100 sequences of m residues and 100 sequences of the length n by reshuffling of residues. It is important that these sequences have the same sizes and residue composition as original ones, because the statistics depend on these parameters. Thus, 100 random alignments can be produced and their scores can be compared with the score of the real alignment. Still, a quantitative estimate of significance is not very straightforward, because the distribution of scores in random alignment is not normal. Qualitatively, it is possible, for instance, to estimate the probability to get a given alignment by chance (P-value) as less than 0.01 if 100 random alignments yielded the scores that are less than the alignment of interest. For ungapped local alignments the statistical estimates can be directly calculated. For two random sequences of lengths m and n (sufficiently large), the probability to find at least one gap-free alignment (called high-scoring segment pair, HSP) with a score at least S (P-value) is equal to P(S) = 1- exp [ - Kmn exp ( - λs) ], where K and λ depend on scoring rules and residue frequencies. The formula is derived from a so-called extreme value distribution, which is valid for the scores of random HSPs. The expected number HSPs with score at least S (E-value) is E(S) = Kmn exp ( - λs). For gapped alignments, a similar theory can be developed, but some large-scale estimates for randomised sequences are necessary to calculate the parameters used in the analytical formulas. The optimal alignment of two sequences (in terms of scoring system) is not always the optimal one in terms of biological significance. Frequently, the outputs of standard alignment programs require further manual improvement. For instance, it can be done on the basis of known features in the sequences such as codons or structural motifs. 9

10 2.2. Sequence database similarity search One of the very frequent problems is to search in the database of sequences for the entries that share some similarity with a given sequence (query). A general approach here is to align the query sequence to each subject sequence in the database, producing a list of so-called hits: sequences with relatively high similarity to the query. For a database search, a straightforward application of the optimal alignment methods (dynamic programming) is not efficient: this would be too slow. Some approximations are necessary to speed up the database scanning. One of the essential approximations is to break the sequences into short words (stretches of residues) and to look for the words similar in two sequences (an idea is that a true alignment of two sequences should contain at least one common word). The first widely used program of this kind was FASTA (Pearson & Lipman, 1988). The main strategy of the algorithm was first to identify some diagonal regions of high similarity according to a simple scoring for words, with subsequent extension and joining of these initiating diagonals using more accurate computation of scores. The main steps of FASTA are as follows (variations of the described strategy are possible): (Pearson & Lipman, 1988) (A) A search for words, for instance, with at least 4 consecutive matches. Then any diagonal region in the complementarity matrix can be scored using some formula that takes into account the number of words and the distances between them. The best (say, 10) diagonal regions are selected at this step. (B) The best 10 regions are rescored using a substitution matrix, allowing conservative substitutions and shorter runs of identities to contribute to the similarity score. The subregions with maximal scores for each of the best diagonals are identified ( initial regions ). (C) The algorithm joins some of the compatible initial regions, computing an optimal combination. (D) Within some band centered around the highest scoring intitial region, the alignment is recalculated using a dynamic programming algorithm. 10

11 Currently, the most popular program for database similarity search is BLAST (Basic Local Alignment Search Tool). BLAST implements a strategy to find statistically reliable local alignments. The main features of this strategy are as follows: (Altschul S.F. et al., 1990) - Instead of looking for words with exact matches, the algorithm searches for words in a subject sequence that have a score at least T (threshold) in comparison with a word from the query. Thus the word size W could be high without sacrificing sensitivity. Using the parameter T, BLAST minimizes the time spent on sequence regions whose similarity with the query has little chance to achieve a score at least S. In the standard BLAST for nucleic acids, W=11. More fast algorithm is used in the megablast with the default of 28 nucleotides. The standard value for protein-protein BLAST is 3. - When a word hit is found, BLAST extends this local alignment, thus yielding the maximal score possible for the region. The optimal extension is calculated by a dynamic programming algorithm. If the alignment extension decreases the achieved highest score by a value exceeding a "significance decay" parameter, the highest score and the corresponding alignment are returned to the user. - An important feature of BLAST is using mathematically tractable statistical estimates in order to derive the optimal parameters. - The BLAST strategy admits numerous variations, in particular, in the ways to allow gaps in the extensions of hits. In particular, a "two-hit" approach of the initial step of "ungapped BLAST", followed by a gapped extension, turned out to be rather efficient. In ungapped "two-hit" strategy, initial search for two words within some distance on a diagonal of the sequence space is carried out. If ungapped extension of these two words along a diagonal scores better than certain threshold, a further improvement by a gapped extension is searched for. An example below shows one of BLAST hits obtained with the sequence of human interferon-gamma (Accession NP_000610) as a query. (For simplicity, here part of the output information is removed.) The output provides the links to the subject sequence entries in the database and the statistical estimates. In the protein sequence search, a consensus sequence with identical residues and conservative substitutions (together defined as positives ) is also shown BLAST hit example (part of the output) interferon gamma precursor [Gallus gallus] Sequence ID: ref NP_ Length: 164 Score Expect Identities Positives Gaps 75.1 bits(183) 4e-14 46/131(35%) 72/131(54%) 10/131(7%) Query 30 EAENLKKYFNAGHSDVADNGTLFLGILKNWKEESDRKIMQSQIVSFYFKLFKNFKDDQS LK FN+ HSDVAD G + + LKNW E ++++I+ SQIVS Y ++ +N + Sbjct 33 DIDKLKADFNSSHSDVADGGPIIVEKLKNWTERNEKRIILSQIVSMYLEMLENTDKSKPH 92 Query IQKSVETIKEDMNVKFFNSNKKKRDDFEKLTNYSVTDLNVQRKAIHELIQVMAELSP 145 I + + T+K ++ KK D L + DL +QRKA +EL ++ +L Sbjct 93 IKHISEELYTLKNNL-----PDGVKKVKDIMDLAKLPMNDLRIQRKAANELFSILQKLVD 147 Query 146 AAKTGKRKRSQ KRKRSQ Sbjct 148 PP-SFKRKRSQ

12 Different BLAST programs are developed for various tasks. In addition to straightforward protein/protein and nucleotide/ nucleotide alignments, BLAST provides the options to compare possible proteins encoded by nucleotide sequences. Such options are very useful, for instance, for finding unannotated genes in the database (using a protein query, to find nucleotide sequences that might encode similar proteins). The main BLAST programs are as follows: - blastn: nucleotide query - nucleotide database - blastp: protein query - protein database - blastx: nucleotide query translated - protein database - tblastn: protein query - nucleotide database translated - tblastx: nucleotide query translated - nucleotide database translated It is possible to limit the search by some part of the database: for instance, blasting human genome, EST database etc. Also, a special option to align two sequences is developed. Statistical estimates are extremely important for the database similarity search, especially due to the enormous sizes of modern databases. Every BLAST hit is accompanied by such estimates, the E-value being the most demonstrative, showing a number of alignments with the scores better or equal to the given one, expected by chance alone (significant alignments should have rather low value). The alignment scores themselves cannot be directly used for significance evaluation; a so-called "bit score" is more informative because it is not dependent on scoring system. For a local pairwise alignment, its E-value and bit score S' are linked by an expression E = m n 2 -S', where m and n are the lengths of (sufficiently large) sequences (or a sequence and a database). Essentially, BLAST searches may be optimized by user for different levels of expected sequence similaty. For instance, highly similar nucleotide sequences are recommended to search using a so-called megablast option. In megablast, relatively large word sizes W are used (default: W=28). More dissimilar sequences may be searched by the discontinuous megablast algorithm (e.g. W=11 or W=12). The blastn option is intended for the less similar sequences, with W=11 (default) or even W=7. Diminishing word size makes the algorithm slower, but allows to retrieve more distant similarities. Furthermore, an option to automatically adjust parameters for short input sequences may be used as well. BLAST is a very widely used program: (hundreds of) thousands of queries per day. It is accessible via ENTREZ system. If a query is taken from the nucleotide sequence database, submitting of its accession number into the BLAST window is sufficient. Otherwise sequences could be submitted in a so-called Fasta format, a standard in many bioinformatic programs. In Fasta format, the first header line starts with right angle bracket sign ">" followed by some identifiers, the subsequent lines should contain only sequence itself. Fasta format examples: >xxx protein name MTINFEKLT... >xxx nucleic acid sequence name agctaggtcaatc... Fasta format files may also include multiple sequences, written on sequential lines, each sequence being started after its header line. Such files could be used in tasks using more than one sequence. 12

13 Progressive multiple sequence alignment 2.3. Multiple sequence alignment Multiple alignment of many sequences is significantly more complicated task than the pairwise alignment. A direct application of accurate dynamic programming procedure is very demanding computationally and is limited to relatively small numbers of relatively short sequences. The commonly used approach is progressive multiple sequence alignment, based on pairwise alignment of the most similar sequences with gradual addition of the more distant ones. The most widely used is CLUSTAL series of programs. The progressive alignment procedure is divided into three main parts: 1. All possible pairwise alignments between sequences in a given set. These pairwise alignments are used to calculate a distance matrix (the pairs of sequences with high similarity scores are assumed to be characterised by short evolutionary distances). 2. A guide tree is calculated from the distance matrix. The branching order of the guide tree, similar to evolutionary trees, follows the extent of relatedness of sequences. 3. The sequences are progressively aligned according to the tree branching order, starting from the closely related sequences. At every step the pairwise alignment is used to align two (clusters of) sequences belonging to some node in the tree. Alignment of sequence clusters differs from simple pairwise sequence alignment only by calculation of substitution scores from many residues rather than from two (see also below). Within an already built cluster, the positions of sequences are not changed, so that the gaps appearing during alignment to another cluster are introduced simultaneously in all cluster members. Various modifications are possible at every part of the algorithm, for instance, different ways to convert similarity scores to distances, clustering techniques to build a tree, approaches to construct consensus sequences etc. Guide tree. The first versions of CLUSTAL used a procedure of UPGMA (unweighted pair group method with arithmetic averages) to build the guide tree. This procedure is very simple from both conceptual and computational points of view. In this method, the first cluster is formed between the most related sequences (say, seq1 and seq2). A distance between this cluster and any other sequence seqx is calculated as arithmetic average d(seqx, new cluster) = [ d(seqx, seq1) + d(seqx, seq2) ] / 2. Now a new distance matrix can be calculated and the closest sequences (or clusters) selected. The procedure is repeated, so as at every step the distance between any two clusters is calculated as arithmetic average of all pairs of sequences from these clusters. In CLUSTAL W and subsequent packages, the neighbour-joining method (NJ) has been implemented. This method estimates a divergence along each branch of the tree. It builds unrooted trees by a star-decomposition procedure. At every step a decision to join two sequences (or clusters) is done using a modified distance matrix that also takes into account a divergence of these two objects from others. The root of the tree is found by a mid-point method: finding a point where the means of branch lengths on two sides of the root are equal. Sequence weights have been introduced in CLUSTAL W in order to compensate an unequal representation of some motifs in the dataset: for instance, the presence of a subgroup of very similar sequences while another subgroup has few representatives. The weights were introduced to up-weight the divergent sequences and down-weight those that are very similar to other sequences. The weight of a sequence was calculated by adding the values equal to lengths of branches in the tree between the sequence and the root, divided by numbers of sequences sharing them, for instance, as shown below: 0.08 seq.1 W(1)= seq.2 W(2)= seq.3 W(3)= seq.4 W(4)=0.226 seq.5 W(5)= W(1) = (0.2/2) + (0.1/4) = W(2) = (0.3/2) + (0.1/4) = W(5) =

14 The weighting is used for scoring the matching of residues in the alignment process. For a given position in the alignment, the score is weighted average of values M from a substitution matrix. If clusters of sequences are considered, sequences from one cluster are compared to sequences from another. For instance, assume aligning a cluster of 4 sequences and a cluster of 2. The score at any position of the alignment is calculated as weighted average of 4x2=8 matrix values: seq.1 PEEKSAVTAL Score = M(T,V)*W(1)*W(5) seq.2 GEEKAAVLAL + M(T,I)*W(1)*W(6) seq.3 PADKTNVKAA + M(L,V)*W(2)*W(5) seq.4 AADKTNVKAA + M(L,I)*W(2)*W(6) + M(K,V)*W(3)*W(5)! + M(K,I)*W(3)*W(6) seq.5 EGEWQLVLHV + M(K,V)*W(4)*W(5) seq.6 AAEKTKIRSA + M(K,I)*W(4)*W(6) In order to optimize the performance of multiple alignment procedure for sequences of various degrees of similarity, some flexible definitions of scoring are very useful, such as position-specific gap penalties and variable substitution matrices. For instance, gap penalties at some position may be decreased if the previous alignment steps have already identified a gap at this position. The choice of the matrix may be for instance BLOSUM80 for % similarities, BLOSUM 62 for 60-80%, BLOSUM45 for 30-60% and BLOSUM30 for less than 30%. The choice of a suitable matrix is done automatically at each step of the multiple alignment. Multiple alignment quality may be esimated in terms of sequence similarity derived from the alignment. Different scoring systems, so-called objective functions, can be introduced. For instance, for each position (column) of the alignment a quality score may be calculated as e.g. sum-of-pairs similarity score (according to a substitution matrix) and the total alignment score is computed as the sum of all columns. One of the serious problems in straightforward progressive multiple alignment is the local maximum problem: due to a greedy nature of the alignment strategy, there is no guarantee that the global optimal solution will be found. Any misaligned region produced early in the alignment process cannot be corrected later, when new information from other sequences is added. In particular, an incorrect branching order of the initial tree may significantly contribute to the problem Alternatives to the progressive alignment strategy Some modifications of progressive alignment strategy can improve the algorithm performance. For instance, iterative procedures may refine and improve the initial alignments and the tree: the estimates of the scores in the intermediate alignments allow the algorithm to make alignments and the tree mutually consistent, thus improving the accuracy of the final result. For instance, a guide tree can be build using so-called k-mer (or k-tuple) pairwise distances (based on common subsequences of the length k in two sequences) and used for progressive alignment. In turn, this progressive alignment can be used to calculate pairwise distances yielding another tree, which could yield a better alignment. It is possible to generate alternative alignments that can be estimated by an objective function and improved using some algorithm for alignment modification. For instance, the algorithm of T-Coffee package (Notredame et al., 2000) begins with a library of alignments: for each pair of sequences in the dataset, both best global alignment and the best 10 local alignments are computed. In the second step, a multiple sequence alignment is assembled, with a criterion to be maximally consistent with the library alignments. In recent years several algorithms have been developed in order to handle large alignments. Progressive alignments that are based on all pairwise distances become impossible in case of large numbers of sequences (time and memory requirements grow quadratically). It is possible to make guide tree construction faster by using a selection of seed sequences and calculation distances of all sequences to seeds only. This strategy is used in Clustal Omega, which is able to construct accurate alignments for large numbers of sequences. Clustal Omega can use both k-mer distances and those based on pairwise alignments (so-called Kimura distances). Clustal Omega also exploits clustering of sequences to construct subclusters that allow the calculation of profiles that describe alignments within subclusters by probabilities of substitutions, deletions and insertions. The profiles can be aligned to each other (see also next chapter), and can be converted into so-called hidden Markov Models (HMMs), like e.g. in Clustal Omega. Alignments of HMMs can be efficiently computed, yielding the full alignment. Finally, it is important to mention that there is no precise definition of the optimal alignment and there is no guarantee that the best solution is found in any of the programs available. Therefore, an inspection of multiple alignments is a must. Usually multiple alignment packages include special editing algorithms to make the changes manually. 14

15 3. Sequence motifs, patterns and profile analysis Sequence motif is a conserved stretch of residues that confer a specific function and/or structure to a biopolymer. Inferring a function from (new) sequence is often based upon comparative sequence analysis with identification of (partial) sequence similarities. Instead of aligning a sequence against many entries of a database (to find a functional similarity), a library of functionally important motifs can be defined and used for similarity search. For instance, a consensus sequence can be determined as the one that has all the absolutely conserved residues and arbitrary ones at positions where some variability is observed: KSQETPAR KSQETQAR KVPEVQGR K..E...R protein 1 fragment protein 2 fragment protein 3 fragment consensus Some consensus sequences may have degenerate positions. Such motifs can be defined using motif descriptors like generalized profiles: C-x(4,6)-[FYH]-x(5,10)-C-x(0,2)-C-x(2,3)-C C---RKNQ--Y-RHYWSENLFQ-C---FN---C---SL---C (profile or pattern) (sequence) Sometimes a pattern is rather weak : [DNSTAGC]-[GSTAPIMVQH]-x(2)-G-[DE]-S-G-[GS]-[SAPHV]-[LIVMFYWH]-[LIVMFYSTANQH] (example of a pattern from the PROSITE database that will match any sequence which contains D, N, S, T, A, G or C followed by G, S, T, A, P, I, M, V, Q or H followed by 2 residues followed by G etc.) Positions with ambiguities can be also indicated with curly brackets: e.g. {AP} means any amino acid except A and P. The information contained in the alignment of sequences constituting a pattern (consensus) can be converted into a scoring based on a position-specific scoring matrix (PSSM). For any target sequence, PSSM (also called profile) specifies the scores of finding each of the 20 amino acid residues at particular positions of a target sequence. For proteins, PSSM has dimensions 21xL, where L is the motif size, to include possible gaps. Of course, similar approach can be applied for nucleic acids, in searching for protein binding sites in DNA, specific RNA motifs etc. PSSM are also used in gene-finding algorithms (see chapter 4). The approach has been first formulated by Gribskov et al. (1987). In this work, the PSSM profiles were calculated on the basis of some alignments called probe. A modification of dynamic programming algorithm has been also designed to compare any sequence to the profile (compared to the standard alignment program, the major difference is simply in counting the scores from the appropriate PSSM values instead of those from a residue-residue substitution matrix). It has been shown that this approach is very effective in both determining whether a single sequence satisfies a motif and retrieving such sequences from a database. Many modern algorithms exploit similar approach. Frequently the scores in the PSSM are taken as proportional to logarithms of frequencies of finding particular residues at particular positions, normalized by background frequencies: position weight matrices (PWM). 15

16 Example of a profile generated by Gribskov s approach for a 49-residue pattern: positions E L V K A S S sequences ( probe ) G L V E P G S V S V A L N N L P V T P S Y Profile positions A C D E F G H I K L M N P Q R S T V W Y Δ (gap penalties) Based on the Gribskov approach and related algorithms, various programs for construction of profiles and for searching a profile library with a sequence query have been developed. Also, various collections of protein profiles are stored in the specialized databases such as PROSITE, Pfam, PRINTS. Being curated databases, these collections have slightly different focuses. PROSITE was one of the first databases of biologically significant patterns and is still one of the most widely used databases. In PROSITE, each entry is a concise description of the protein family or domain as well as a summary of the reasons leading to the development of the pattern or profile. A very tight relationship exists with the SWISS-PROT protein sequence data bank. Updating of PROSITE and of annotations of the relevant SWISS-PROT entries are often done in parallel. Many programs are developed by different groups to search PROSITE. The InterPro database is an integrated documentation resource with access to several databases of patterns and a combined search program InterProScan. 16

17 PSI-BLAST: a new generation of protein database search programs An approach based on iterated construction of profiles and their use for searching in the protein sequence database resulted in a very sensitive algorithm to search for weak but biologically significant similarities. The algorithm, called Position-Specific Iterated BLAST (PSI-BLAST), is based on repeated cycles of combining the hits obtained by a BLAST search into multiple alignment with the query as an anchor, followed by construction of corresponding PSSM and using this PSSM as a query. As a result, a PSI-BLAST search may retrieve some distantly related sequences as compared to the standard BLAST, because the search becomes directed to find a (presumably) functionally important motif. A simplified description of the approach: (Altshul et al., 1997) Collection of all database sequence segments aligned to the query with E-value below a threshold (default 0.01). Multiple alignment M. Alignment columns involving gaps inserted into the query are ignored, so M has the same length as the query. The matrix scores for a given alignment column could be calculated, using sequence weighting to avoid outvoting a small number of divergent sequences by a large set of closely related ones. Position-specific score matrix (PSSM) is constructed. PSSM is submitted to BLAST as a query (only minor modifications to the code as compared with single sequence querying). Multiple alignment M etc. Some related programs have been developed, for instance, IMPALA (Integrating Matrix Profiles And Local Alignments): a software package for searching a database of PSSM (PSI-BLAST-generated) using a protein sequence as query. Furthermore, the search in the database of profiles is included in current standard BLAST protein-protein program: the detection of known motif (if found) is always displayed before the list of similarity hits. 17

18 4. Gene finding and genome annotation Identification of gene positions in a genome is a complex task. The pipelines for analysis (annotation) of genomic sequences include multiple algorithms, based on various features of coding regions. Prokaryotes and eukaryotes present different problems for gene finding. Mechanisms of translation in these organisms are different. Prokaryotic mrnas may contain several genes, while eukaryotic mrnas are nearly always mono-cistronic. Eukaryotic genes contain introns while prokaryotic ones don t. Transcription of eukaryotic mrnas depends on specific transcription factors, which recognize different motifs in gene promoters. The algorithms for identifying of protein coding regions in nucleic acid sequences can be classified into three main categories: - search by signal (search for a specific pattern or consensus); - search by content (constraints upon sequence due to the coding); - similarity-based search (similarities to known genes and/or RNA transcripts). Finding genes coding for RNA molecules that do not code for proteins (non-coding RNAs) is based on the identification of specific motifs and/or RNA structures Search by signal In prokaryotes, very few promoter sequences (so-called "-10 regions") have exact consensus sequence. A matrix evaluation is possible to search for potential sites. The position-specific score matrices (PSSMs) or position weight matrices (PWMs) [see above] can be based on collections of known promoters. For instance, for the consensus promoter sequence in E.coli the following PWM can be constructed : A C G T (Stormo, 2000) For the consensus sequence TATAAT the score is the sum of numbers shown in bold, 85. However, a choice of a threshold in scoring of sequences according to such a matrix is not a trivial issue. Many promoters can significantly deviate from the consensus, while this consensus may occur outside promoter regions as well. Because of multicistronic operons, a search for translation initiation signals in prokaryotes is more important search than that for promoter localization (in contrast to eukaryotes). In addition to the start-codon ATG, a ribosome binding site (RBS, Shine-Dalgarno site) should be searched for. It is located at 8-9 nucleotides upstream of the start codon, below is the sequence-logo representation of the consensus derived from a dataset of E.coli initiation regions: (Shultzaberger et al., 2001) In eukaryotes there is no Shine-Dalgarno sequence. A scanning model of ribosome is usually valid, so if the 5 end of mrna is known, the first good ATG in the 3 -direction is normally a start codon. The consensus requirements are more simple than in prokaryotes, a so called Kozak consensus of a strong start is AccATGg. For identification of eukaryotic promoters, important signals are the binding sites of various transcription factors. The PSSMs describing these sites are collected in specific databases. Frequently these matrices are used for search in combination with other features (see also below). Consensus nucleotides near exon-intron and intron-exon junctions can be used for searching for positions of introns in eukaryotic genes. However, the matrices based on these consensa are not very specific. Search for signals can fail in many cases, because these signals are not always present or may occur in other context. Thus both false-positive and false-negative results can be expected. Therefore, many algorithms implement the search for content. 18

19 4.2. Search by content Search by content is based on the fact that coding for a protein puts constraints on the nucleotide sequence. The most obvious constraint: a coding sequence should not have a stop-codon within it. In a random sequence, the probability of any codon being followed by an open reading frame (ORF) of over 50 codons is less than 10% and that of over 100 codons less than 1 %. Thus, since most proteins are more than 100 amino acids, searching for long ORFs is an effective way of finding genes in prokaryotes: AUG... (long ORF)... UAG In genomes with high G-C content the reliability of searching for long ORFs decreases because the probability of random encountering a stop-codon (in any frame) is decreased. In eukaryotes, ORF searches are not reliable due to intron presence and relatively short exon lengths. Searching for ORFs can be made more specific by taking into account amino acid occurrence and codon preferences. It turns out that proteins with different functions and from all species share a typical average amino acid composition. This can be used to rate ORFs on their likelihood of being translated. Also, some codons are more frequent than others, and this may be used as well. Some other biases are observed in sequences, and this can be also used to measure the probability of a given sequence to be protein-encoding. Among the most important features, collectively called coding measures, are: - ORF measure (ORF length); - amino acid frequencies; - codon frequencies; - frequencies of bases at certain codon positions (composition measure); - hexamer measure (e.g. the frequency of CAGCAG is relatively high in exons). One of the efficient ways to combine coding measures to define a single value called a discriminant. A discriminant analysis can be done by categorizing samples (in this case, sequences) into two classes and finding the best discriminant that separates the classes. It is very important to note that many coding measures are organism - specific. Therefore many computer programs are specifically designed (trained) to predict protein-coding regions in particular species and are not applicable for other organisms. Modern programs frequently combine the search by signal and search by content into a single algorithm. For instance, search for promoters may score the nucleotide sequences by both the scores defined by some transcription-factor PSSM and the values derived from certain promoter-specific measures (e.g. frequencies of some oligonucleotides). In particular, vertebrate promoter regions frequently have higher contents of CpG nucleotides (CpG islands) than nonpromoter regions. The search for promoters can be significantly enhanced by comparative analysis (see also below) in various species and gene expression data Probabilistic models Basic idea of a probabilistic approach for gene finding is to find the sequence annotation (locations of coding and noncoding sequences) that has the highest probability. The goal is to define a functional role for each nucleotide in the sequence, so as for a DNA sequence S = {n1, n2,..., nl}, (ni is a nucleotide at position i ) the functional role of each nucleotide could be indicated by a functional sequence A = {a1, a2,..., al}, where each ai may take some values depending on a functional role, e.g. 0 if ni is a part of non-coding region, value 1 if ni is a part of a gene etc. In other words, the aim of gene-finding is to determine the most likely functional sequence A for given DNA sequence S. In a Hidden Markov Model (HMM) algorithm, statistical patterns specific for coding/non-coding sequences may be converted into parameters of Markov chain models (a Markov chain represents a series of observations wherein a probability of a given observation is dependent on previous observations). A biologically relevant nucleotide sequence may be modeled in this way, because a probability of observing particular nucleotide is dependent on nucleotides observed at other positions. For instance, within a coding region a probability of observing nucleotides G or A after two nucleotides TA in a coding frame is zero, because this would lead to the stop-codon UAG or UAA (in a non-coding region, there is no such dependence). Many other parameters, e.g. specific nucleotide frequencies, may be introduced into a model. In the framework of such a model, the algorithm is able to compute the probability that any particular functional sequence A underlies a given sequence S. Finding the sequence A corresponding to the highest probability and estimating reliability of the result actually means finding functional regions. 19

20 4.4. Similarity-based search The (significant) similarity of a region in a genome to a known gene in some other genome is one of the most reliable predictors. For instance, the hits of BLASTX (translated nucleotide query vs. protein database) may already predict some genes contained in the query. Another approach is to compare genomic sequences with the databases of expressed sequence tags (EST), such as dbest database of NCBI. EST s are short ( bp) single reads from 3 ends of mrna. Although these sequences are frequently incomplete and contain errors, a match between a genomic sequence and an EST can be used for gene annotation, especially when the EST spans several exons. Even crossspecies EST comparisons (e.g. human vs. mouse EST) can be used for gene searching. Databases of RNA-Seq reads, obtained on genomes of various species, can also be efficiently used for alignments of these transcripts to genomic sequences leading to gene finding Genome annotation strategies The programs for gene finding are generally classified as similarity-based and ab initio. Ab initio programs use both rulebased algorithms (searching for specific signals and/or contents, usually with some scoring) and probabilistic models. There is no ideal universal approach. Annotation strategies combine gene-finding algorithms with a processing of transcriptomics data, yielding elaborated genome annotation pipelines. Of course, the approaches for prokaryotic and eukaryotic genomes differ considerably. Prokaryotic genome annotation pipelines use the predictions of open reading frames by ab initio methods and alignments (in particular, by tblastn algorithm) of the putative ORFs to the previously known proteins. Comparative analysis of prokaryotic ORFs is very important, as the number of available prokaryotic genomes is exceeding a typical number of genes in a genome. Eukaryotic genome annotation pipelines (e.g. NCBI eukaryotic genome annotation pipeline, ENSEMBL genebuild), attempt to integrate maximum amount of relevant information on mrna transcripts, obtained in experiments and available in various databases and genome browsers. The transcripts are aligned to the annotated genome, what yields the positions of expressed exons in the genes. The annotation pipelines may also include gene prediction programs that combine similarity-based information with models yielded by ab initio algorithms. A number of additional procedures can be carried out such as data filtering, protein domain annotation, cross-referencing to external databases etc. It should be noted that large number of transcription products of human genome and other eukariotic genomes do not have protein-coding potential and may be classified into different categories. A number of known so-called non-coding RNAs (ncrnas), such as micrornas and long non-coding RNAs is growing. 20

21 5. Protein structure prediction The development of algorithms for protein structure prediction from sequence has relatively long history. Various secondary (2D) structure prediction algorithms consider the residues in a polypeptide chain to be in three (helix, strand, coil), four (helix, strand, coil, turn) or even eight states and try to predict the location of these states. The majority of methods for protein structure prediction are based on empirical (statistical) approaches. They try to extrapolate the statistics, revealed in the known structures, to other proteins. The most simple observation: - Ala, Gln, Leu and Met are commonly found in α helices; - Pro, Gly, Tyr and Ser usually are not; - Pro a helix-breaker (bulky ring prevents the formation of n/n+4 H-bonds) The algorithm of Chou and Fasman (1978) was a widely used approach in 1970 s-1980 s. It was based on the calculation of a moving average of values that indicated the probability (or propensity) of a residue type to adopt one of three structural states, α-helix, β-sheet and turn conformation. The probabilities were simply the frequencies of a given residue type to be observed in a particular secondary structure, normalized by the frequency expected by chance. Some additional heuristic rules were added that attempted to determine the exact ends of secondary structure elements. In the further developments of these ideas, the algorithm of Garnier et al. (1978) introduced an estimate of the effect that residues within the region eight residues N-terminal to eight residues C-terminal of a given position have on the structure of that position. For each of 20 amino acids, a profile (17 residues long) was defined that quantifies the contribution this residue makes towards the probabilities of other residues to be in one of four states, α-helix, β-sheet, turn and coil. Thus, for every residue the values from +8 positions are added (four values from each), so as four probability profiles are produced, and at any position the highest profile value predicts the structure. Further developments of prediction methods (until the early 1990s) were based on implementing various rules for pattern recognition, in particular, the hydrophobicity patterns, periodicity of helices, different ways for calculating propensities in windows of variable sizes (3-51 residues). However, it seemed that prediction accuracy of such approaches stalled at levels slightly above 60% (percentage of residues predicted correctly in one the three states: helix, strand, and other). Next generation of methods incorporated multiple alignments into predictions. They incorporated a concept that homologous proteins should have similar structures. For example, all naturally evolved proteins with more than 35% pairwise identical residues over more than 100 aligned residues have similar structures (Rost, 1999). Predictions were further improved by application of neural networks, trained on known structures. An efficient algorithm has been developed that combined multiple alignments with neural networks for secondary structure predictions (PSIPRED, Jones, 1999). At the first step, the algorithm uses iterations of PSI-BLAST so as to construct a profile (PSSM) corresponding to the query protein. This profile is used as the input to neural network, and the final output is a 3-state prediction (helix/strand/coil). Multiple algorithms for protein 3D structure prediction are developed. The modern approaches for protein tertiary structure prediction can be divided into three general strategies: (1) Comparative (homology) modeling. If there is a clear sequence homology between the target and one or more known structures, an algorithm tries to obtain the most accurate structural model for the target, consistent with the known set. (2) Fold recognition. Algorithms that try to recognize a known fold in a domain within the target protein. (3) Ab initio methods. Modeling of structures using energy calculations, considering both secondary and tertiary structures. There are overlaps between the methodologies. 21

22 5.1. Homology modeling Homology modeling approach assumes that a sequence similarity between a target protein and at least one related protein with known structure (the template) implies the 3D-similarity as well, and therefore allows one to extrapolate template structures to the target sequence. Main steps in comparative protein structure modeling 1. Identify related structures. 2. Select templates. 3. Align target with templates. 4. Build a model for the target (using information from template structures). 5. Evaluate the model. 6. If model is not satisfactory, repeat the steps 2-5 or 3-5. (Fiser et al., 2001) Identification of structures related to the target sequence is usually done by searching the database of known protein structures (PDB) using the target sequence as the query. For instance, a specific algorithm for template search in PDB is PDB-BLAST that not only builds a multiple alignment using target as a query, but also constructs similar multiple alignments using all found potential templates as queries (here each alignment is called sequence profile, and it can be converted into a matrix). The templates are found by comparing the target sequence profile with each of the template sequence profiles (e.g. by dynamic programming method). This allows to capture essential sequence motifs for a fold to be predicted. Selection of optimal templates is guided by several criteria, such as overall sequence similarity to the target and the quality of the experimentally determined structure. It is not necessary to select only one template: alignments of the target to different templates may be used for model building. These alignments may be refined after template selection. Various algorithms are used for model building, e.g. based on "rigid bodies" or spatial restraint satisfaction (based on core conserved structures obtained from aligned template structures). There are systems for (web-based) automated homology modeling that are able to predict the structures for one or many sequences without human intervention. For instance, SWISS-MODEL ( one of the first servers for protein structure predictions, initiated in 1993 and accessible via the ExPASy web server Fold recognition In case of relatively low sequence similarity of a target protein to the known structures, an attempt may be done to recognize a known fold within the target by a search for an optimal sequence-to-structure compatibility. Mostly this is done by threading algorithms: a target sequence is threaded through templates from the structure database and alternative sequence-structure alignments are scored according to some measure of compatibility between the target sequence with the template structures. The scoring is done using threading potentials, for instance so-called knowledge-based or mean-force potentials. These potentials do not consider physical nature of interactions in proteins and are derived from the statistics of the databases of known structures. Knowledge-based (database-derived) mean force potentials incorporate all forces (electrostatic, van der Waals etc) acting between atoms as well as the influence of the environment (solvent). For the interaction between two residues (a,b) with a sequence separation k and distance r between specified types of atoms (e.g. Cβ Cβ, Cβ N etc.) a general definition of the potential is ( inverse Boltzmann equation ) E ab k (r) = - RT ln [ f ab k (r) ], where the occurrence frequency f ab k (r) is obtained from a database of known structures. A definition of the reference state is very important. A convenient choice for the reference state is Ek (r) = - RT ln [ fk (r) ], where fk (r) is an average value over all residue types. Thus the potential for the specific interaction of residues is ΔE ab k (r) = E ab k (r) - Ek (r) = - RT ln [ f ab k (r) / fk (r) ] A solvation potential for an amino acid residue a is defined as: ΔE a solv (r) = - RT ln ( f a (r) / f(r)), where r is the degree of residue burial, f a (r) is the frequency of occurrence of residue a with burial r, f(r) is the frequency of occurrence of an arbitrary residue with burial r. 22

23 Obviously, threading algorithms are more complex than sequence-sequence alignments and require some approximations. For instance, in ungapped threading, the query sequence is mounted over an equally long part of template fold. The total alignment score is easily computed as sum of pairwise potentials for all query residues. Ungapped threading is mostly used not for real predictions, but rather for testing and adjusting the energy functions. Treatment of gaps in sequence-structure alignments presents a problem because a score that should be computed for a given residue in a query sequence, assumed to be aligned to a residue in a template structure, would depend not only on the type of these two residues (as in sequence-sequence alignments), but on the gaps that may be introduced at other alignment positions. Here usually a so-called frozen approximation is used. In frozen approximation, an appropriate comparison matrix of the size N M (N residues in the template and M in the query) is calculated by replacing the amino acids in the template structure with amino acids from the target sequence one at a time. The rest of the structure is kept intact, and it is assumed that the field created by the native protein will also favour the correct replacement. Although it is a very crude approximation, it is rather efficient, despite the fact that it does not solve the full threading problem. (Sippl, 1993) The subsequent steps are equivalent to sequence-sequence alignment: the scores in the comparison matrix may be used for calculating dynamic programming matrix leading to the final alignment. Often in fold recognition methods several scores are produced, related to different aspects of the sequence-structure alignment, some of the most important: initial sequence profile alignment score, number of aligned residues, length of target sequence, length of template sequence, pairwise energy sum, solvation energy sum. Each of these scores separately may be not sufficient, but they can be used to calculate some determinant score. 23

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database