Computational Molecular Biology. Lecture Notes. by A.P. Gultyaev

Size: px
Start display at page:

Download "Computational Molecular Biology. Lecture Notes. by A.P. Gultyaev"

Transcription

1 Computational Molecular Biology Lecture Notes by A.P. Gultyaev Leiden Institute of Applied Computer Science (LIACS) Leiden University January

2 Contents Introduction Sequence databases Nucleotide sequence databases Protein sequence databases Sequence alignment Pairwise alignment Sequence database similarity search Multiple sequence alignment Sequence motifs, patterns and profile analysis Gene finding and genome annotation Search by signal Search by content Probabilistic models Similarity-based search Genome annotation strategies Protein structure prediction Homology modeling Fold recognition Ab initio protein structure prediction Combinations of approaches Predictions of coiled coil domains and transmembrane segments Inverse folding and protein design RNA structure prediction Thermodynamics of RNA folding and RNA structure predictions Comparative RNA analysis Detecting conserved structures in related RNA sequences RNA motif search Recommended reading Exercises

3 Introduction The course Computational Molecular Biology aims to make participants familiar with a number of widely used bioinformatic approaches. The purpose of the course is not to make a guided tour along the available resources in the Internet such as databases, programs etc. but rather to give an idea about the main principles behind efficient algorithms for genomic data analysis. Of course, the text below is just some (lecture) notes that may assist in understanding the material given in lectures and should not be considered as independent comprehensive description. This material is accompanied by exercises with tasks to be accomplished using a number of bioinformatic resources, in order to get hands-on experience. Basic Molecular Biology Introduction: Biopolymers (DNA, RNA, proteins) are the main components of life on Earth. DNA and RNA molecules are the polynucleotide chains containing four nucleotides, protein polypeptide chains contain 20 residue types (amino acid residues). The main "dogma" of Molecular Biology states that DNA is a carrier of genetic information, which is transcribed into messenger RNA (mrna) molecules that are translated to proteins responsible for functioning and reproduction of organisms. Analysis of monomer sequences of biopolymers is of great importance for understanding all living systems, from viruses to humans. Computational analysis of sequence data involves large databases and complex algorithms. The sequence analysis algorithms are developed for sequence comparisons, recognition of functional sequence patterns and processing of experimental sequencing data. Sequence analysis is the core of bioinformatics. Genes carrying genetic information in the majority of species (except some viruses) are the regions of double-helical DNA genomes containing two complementary antiparallel polynucleotide strands that have a so-called 5'->3' polarity (following the nomenclature of atoms in monomer units). Each monomer unit (nucleotide) contains phosphate, sugar (desoxyribose) and one of nucleotide bases: adenine (A), thymine(t), guanine (G) or cytosine (C). The complementarity is determined by Watson-Crick base-pairing in AT and GC pairs. Thus a double-helical DNA molecule is described by the sequence of one of its strands, e.g. TAGCGCAGGG... (in this case the complementary strand in this fragment is...ccctgcgcta). DNA replication mechanism ensures the transfer of genetic information upon reproduction, sequence changes occurring due to replication errors (mutations) may lead to disorders or evolutionary development. Transcription of DNA into RNA is carried out by RNA polymerase complex which uses one of DNA strands as a template. The resulting RNA sequence is determined by Watson-Crick complementarity rules and thus is the copy of complementary DNA strand sequence, except that thymine is substituted by another nucleotide base, uracil (U). Start and termination points of transcription are important elements of gene sequence analysis. Sequence signals that determine transcription starts, are called promoters. Apart from a protein-coding region, which is translated to amino acid sequence, an mrna molecule contains two untranslated regions "before" (upstream) and "after" (downstream) of the coding sequence. These regions are called 5' untranslated region (5'UTR) and 3'UTR, respectively. Translation of a nucleotide sequence in an amino acid sequence occurs according to the genetic code, with amino acids encoded by triplets of nucleotides (codons). Genetic code is redundant, as the majority of amino acids can be encoded by different codons, frequently with the last 3rd position being so-called "wobble" one. Start-codons are ATG sequences (or AUG in RNA, encoding for methionine amino acid), stop-codons are TAA, TAG and TGA. The translated region between a startand stop-codon is called open reading frame (ORF). Due to the triplet coding, for any DNA sequence 6 reading frames are possible: 3 frames in each of the complementary strands. In prokaryotic organisms (those lacking nucleus in their cells), an mrna molecule may contain an operon consisting of several ORFs coding for several proteins. In eukaryotes, genes have different complexity: mrna molecules with ORFs are not transcribed directly, but are produced by processing of precursor pre-mrna molecules. In the processing (RNA splicing), non-coding pre-mrna regions (introns) are removed, and the rest (exons) are ligated, leading to mature mrna molecules. A number of eukaryotic genes are characterised by alternative splicing, leading to various exon combinations in mrnas and, therefore, different encoded proteins. Biopolymer sequences contain multiple sequence patterns that regulate their functions. Furthermore, these functions depend on 3D structures of biopolymer molecules. Prediction and analysis of the 3D structures on the basis of onedimensional sequences is also an essential part of bioinformatic sequence analysis. 3

4 1. Sequence databases The databases of protein amino acid sequences have appeared before nucleotide databases. The Atlas of Protein Sequences and Structures was published in The development of efficient DNA sequencing techniques resulted in the explosion of the information on nucleotide sequences. In the early 1980 s three major nucleotide sequence databases have been created: European EMBL nucleotide sequence database, American GenBank and Japanese DDBJ. In 1988 an agreement of a common format has been achieved. Nowadays, the three databases exchange all sequences. However, any update of a sequence entry is performed by a database that owns the entry from the moment of direct submission, to prevent conflicts of updates. The main elementary unit of sequence databases is a flatfile representing a single sequence entry. Although the database management systems (DBMS) of modern sequence databases may invoke relational database management systems (DBMS), flatfile format remains a convenient format of exchange between GenBank, EMBL and DDBJ databases. Some small (mainly syntax) differences between flatfiles of the three databases do exist, for instance, linetype prefixes at every line of EMBL files. For an example of sequence entry flatfile see Sample GenBank Record, The main three parts of the GenBank entry are the header (description of the record), features (annotations on the record) and the nucletide sequence itself. These main parts contain various datafields. The header contains the description of sequence including data such as names of organism, gene name, sequence length, keywords, literature reference data. A very important part of the header is identifiers that are used for searching the entries. Two identifiers are used: Accession and GI numbers. An accession number is unique for a sequence record. It does not change, even if information in the record is changed (e.g. sequence correction), but version number is assigned (U ; U ). GI-numbers (GenInfo) identifiers are unique for every sequence: if a sequence changes, a new GI-number is assigned. Features describe the biological information relevant to the sequence (annotation), such as encoded proteins, RNA sequences, various signals such as protein binding sites, mrna, exons and introns etc. Many of the datafields in GenBank sequence entry provide links to the relevant entries of other databases such as literature (e.g. PubMed) or protein databases. These are so-called hard links (between entries from different databases that share the same datafields) of the ENTREZ retrieval system. ENTREZ is maintained by American National Center for Biotechnology Information (NCBI) and provides unified access to various databases and resources. In addition to the access to the primary database of nucleotide sequences (to which data is submitted by scientific community), ENTREZ provides the access to various curated databases such as RefSeq (non-redundant reference sequences), ENTREZ Gene (gene-centered information) etc. ENTREZ has an unified query system for all databases, with useful options such as searching in specific datafields or using Boolean operators. One of the oldest and very popular protein sequence databases is SWISS-PROT. It was created in 1987 by collaboration of University of Geneva and EMBL. The database is accessible via both Entrez (NCBI) and the EMBL - European Bioinformatics Institute (EMBL-EBI) systems. It can be called a secondary (curated) database: it adds value to what is already present in a primary database (nucleotide sequence database). A very small proportion of amino acid sequences is directly submitted to SWISS-PROT. The data in SWISS-PROT sequence file may be divided into the core data and annotation. The core contains a sequence (single letter coding), bibliography and taxonomic information. The annotation includes function(s) of the protein, domains, specific sites etc. In addition to SWISS-PROT, the protein entries in the Entrez retrieval system have been compiled from several databases, and also include translations from annotated coding regions in GenBank and RefSeq. ENTREZ Protein database entries provide hard links to relevant entries from other databases. Furthermore, ENTREZ system provides the links to the protein sequences similar to the one contained in the accessed entry (a concept of neighboring entries in ENTREZ). It should be noted that some protein-coding sequences (open reading frames, ORFs) may remain unnoticed (unannotated). A popular resource of ENTREZ is the ORF Finder with relatively simple algorithm searching for potential ORFs in a given nucleotide sequence in all 6 reading frames (including complementary strand). A similar tool is provided by EMBL-EBI: EMBOSS Transeq ( 4

5 2. Sequence alignment In general, alignment of sequences can be defined as mapping between the residues of two or more sequences, that is, finding the residues that are related to each other due to phylogenetic or functional reasons. One of the main goals of sequence alignment is to determine whether sequences display sufficient similarity to conclude about their homology. However, similarity is not equivalent of homology. Similarity (of sequences) is a quantity that might be measured in e.g. percent identity. Homology means a common evolutionary history. In principle, all alignment methods try to follow the molecular mechanisms of sequence evolution, namely substitutions, deletions and insertions of monomers: AGTCA AGTCA AGT-CA!!! AGACA AG-CA AGTACA (substitution) (deletion) (insertion) An alignment that spans complete sequences is called global alignment, an alignment of sequence fragments - local alignment. Depending on a biological problem, either global or local alignment is more efficient to detect a similarity between sequences. For instance, global alignments are advisable in case of relatively closely related sequences. Local alignments are a powerful tool to localize the most important regions of similarity, e.g. functional motifs, that may be flanked by variable regions that are less important for functioning. Obviously, local alignments should be applied in cases of mosaic arrangement of protein domains or for comparisons between spliced mrnas and genomic sequences. Alignment tasks can be subdivided into pairwise alignment (two sequences), multiple alignment (more than two) and database similarity search (searching for the sequences in the database that are similar to a given sequence) Pairwise alignment For two sequences, many different alignments are possible (for instance, about different alignments for two sequences of length 300). Obviously, an alignment with relatively high number of matching residues in two sequences and relatively small number of deletions/insertions is the most likely to have biological meaning, because it assumes smaller number of evolutionary events. The problem to find such an alignment is equivalent to finding the best path in a dot matrix, where two sequences are plotted at the coordinates of a two-dimensional graph and dots indicate matching residues. Qualitatively, the best global alignment would be a path visiting as many dots as possible with a preference for diagonal moves as compared to horizontal or vertical moves (because the latter represent deletions/insertions in the sequences). Dot matrices are also very demonstrative for visualizing alignments, especially after filtering local regions of low similarity, for instance, leaving only diagonals of consecutive similar residues. Sequence 2 Sequence 2 A G C T A G G A G A G C T A G G A G A... A. G.... G. C. C. G.... G. Seq.1 G.... G. A... A. G.... G.. A... A. G.... G. The example shown above suggests two alternative alignments: AGCTAGGAG AGCTAGGAG-- seq.1 or AGCGGAGAG AGC--GGAGAG seq.2 (filtering out similarities of less than 3 consecutive residues) The second alignment has more matching residues than the first one, but it contains deletions. Obviously, the selection of the best alignment depends on scoring system for matches, mismatches and deletions/insertions. Thus, finding the optimal alignment requires: (1) a definition of scores for matches, mismatches and deletions/insertions; (2) an algorithm to find the best score. 5

6 2.1.1 Substitution matrices A simple match/mismatch scoring (such as 1 for a match, 0 for mismatch) is not the most effective, especially in proteins, where conservative substitutions of amino acids with similar properties are relatively frequent. For instance, substitutions of functionally important polar residues by polar ones or mutual substitutions of hydrophobic residues, e.g. Arg Lys or Val Ile. Therefore, the idea of substitution matrix (Dayhoff et al., 1978) has been introduced. Such a matrix (dimensions with diagonal symmetry) defines different scores for residue matches and all possible substitutions as well. The score values can be estimated from the previously aligned sequences with trusted alignments. An efficient way to introduce such scores is a log-odds calculation: a score proportional to the logarithm of the ratio of observed substitution frequency (target frequency) to the background frequency which is determined by chance. Thus, the score for a substitution of two residues a and b is S(a,b) ~ log [ p ab observed / p ab background ] = log [ p ab observed / f a f b ], where p ab observed is the probability (frequency) of a/b substitution observed in used alignments, and f a and f b are the amino acid occurencies. The logarithm is used for calculations because adding such scores for positions in any new alignment would be equivalent to estimating a likelihood of the hypothesis that the alignment reflects sequence homology and has a non-random pattern (this likelihood is equal to the product of likelihoods of non-random aligning of all residue pairs, if the pairs are statistically independent). The first widely used substitution matrices were based on the point-accepted-mutation (PAM) evolutionary model. In this model, a unit of protein divergence (1 PAM) was introduced, corresponding to a change of 1% of amino acid sequence. It is very important to note that 100 PAM change does not lead to a complete change of the sequence, because some residues may change several times while others will not be substituted at all. Alignments of very closely related sequences (trusted alignments in any scoring system) were used to estimate the target frequencies corresponding to 1 PAM. These values can be extrapolated to more distant sequences, e.g. PAM250 (the matrix originally published by M. Dayhoff et al., 1978). A different strategy was used for so-called BLOSUM matrices. They were derived from alignments contained in the database BLOCKS of local multiple alignments. Here the frequencies were computed directly from alignments. In contrast to PAM model, relatively distant sequences were considered, to avoid statistical biases due to overrepresentation of some sequences. Thus, BLOSUM62 matrix has been computed from the sequences having maximum 62% identity (sequences with higher identities were merged into single strings). The performance of BLOSUM62 turned out to be very good, and this is a standard matrix for many protein alignment programs. The BLOSUM62 scoring matrix: A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V 6

7 For DNA, the 4 4 matrices are relatively more simple, for instance, BLASTN program is using a scoring system with the score +2 for a match, and -1 for a mismatch. Sometimes the differences between more frequent purine-purine or pyrimidine-pyrimidine transitions (A G, C T ) as compared to transversions (A C, A T, G T etc.) are taken into account. For instance, a matrix used in BLASTZ, a program for comparisons of large genomic sequences: A C G T A C G T Gap penalties Locations of insertions/deletions in the alignments are called gaps. In attempt to minimize the number of gaps, some negative scores (penalties) are introduced. The penalty values are subtracted from the scores calculated in non-gapped regions of alignments. The most common formula for a gap penalty is S(gap) = G + L. n, where G is the gap-opening penalty (also called gap existence cost), L is the gap-extension penalty (gap extension cost) and n is the number of residues in the gap (the gap length). There is no solid theory here and the choice of values is highly empirical. In the combination with BLOSUM62 substitution matrix, the values of G around and L around 1-2 turned out to perform rather well Algorithms for finding the optimal alignment The problem of finding the optimal alignment of two sequences (given a defined scoring system) has been solved using a dynamic programming algorithm. Dynamic programming approach is used in various complex optimization problems when solution of a problem can be reduced to a subproblem solution with some recursive calculation for the problem solution on the basis of the subproblem. In case of the alignment problem, formulated as finding the best path in a matrix (using scores for matches/mismatches and gap penalties), a very important observation is that any partial subpath of the optimal one is also the optimal path for the alignment of two subsequences. The main recursive definition for alignment of two (sub)sequences of the length m and n residues is as follows: the optimal score S(m,n) is the best value out of three scores: S(m-1, n-1) + (score of aligning residues m and n), S(m-1, n) + (gap penalty), S(m, n-1) + (gap penalty). Using this definition is possible to both build a matrix of the best scores for all possible subalignments (called dynamic programming matrix) and to backtrack the optimal alignment for the full sequences. The dynamic programming matrix is filled starting from the smallest subsequences using bottom-up approach. After completing the matrix, it is possible to restore the sequence of calculations that has led to the final score. 7

8 An example (from S. Eddy, 2004): two DNA sequences X = TTCATA and Y = TGCTCGTA. The scores: +5 for a match, -2 for a mismatch, -6 for each insertion or deletion. The dynamic programming matrix is filled as follows: 0 T G C T C G T A (Y) [alignment to initialising 0] T [ S(1,1)=5 (T-T match) ] T X C other values A correspond to gaps T A S(2,2) = max (5-2; -1-6; -1-6) = 3 S(2,3) = max (-1-2; 3-6; -7-6) = -3 S(2,4) = max (-7+5; -3-6; -13-6) = -2 etc. S(3,3) = max (3+5; -3-6; -3-6) = 8 S(3,4) = max (-3-2; 8-6; -2-6) = 2 S(3,5) = max (-2+5; 2-6; -8-6) = 3 etc. 0 T G C T C G T A T T C A T A complete dynamic programming matrix: 0 T G C T C G T A T T C A T A The final best score S(6,8)=11 has been computed (backtracking) as follows: S(6,8) S(5,7) S(4,6) S(3,5) S(2,4) S(1,3) S(1,2) S(1,1) (+5) (+5) (-2) (+5) (+5) (-6) (-6) The backtracking allows one to reconstruct the optimal alignment: T--TCATA sequence X TGCTCGTA sequence Y Such an algorithm, given a scoring system, guarantees to find the optimal global alignment of two sequences. The algorithm is often called Needleman-Wunsch algorithm by names of the authors (1970). An extension of this strategy, the Smith-Waterman algorithm (1981) has been developed to find the optimal local alignment. A locally optimal alignment is defined as the one that cannot be improved either by extending or shortening the alignment regions. Both global and local optimal alignments can be computed, for instance, by a tool align ( algorithms needle and water. 8

9 Estimates of statistical significance for alignments Given a scoring system and a (dynamic programming) algorithm for optimal alignment, any two sequences can be aligned. However, there is no guarantee that this alignment is a consequence of homology between the sequences. Apparently, this is likely to be true if the obtained alignment would be very unlikely to be determined by chance alone. Therefore, an estimate of statistical significance is very important. For global alignments, there is no mathematical theory to estimate a score that can be expected by chance. Significance of a global alignment can be estimated by comparison with statistics yielded by alignments of randomised (permuted) sequences. For instance, in order to estimate a significance for the alignment of two sequences X and Y of lengths m and n residues, respectively, one can generate 100 sequences of m residues and 100 sequences of the length n by reshuffling of residues. It is important that these sequences have the same sizes and residue composition as original ones, because the statistics depend on these parameters. Thus, 100 random alignments can be produced and their scores can be compared with the score of the real alignment. Still, a quantitative estimate of significance is not very straightforward, because the distribution of scores in random alignment is not normal. Qualitatively, it is possible, for instance, to estimate the probability to get a given alignment by chance (P-value) as less than 0.01 if 100 random alignments yielded the scores that are less than the alignment of interest. For ungapped local alignments the statistical estimates can be directly calculated. For two random sequences of lengths m and n (sufficiently large), the probability to find at least one gap-free alignment (called high-scoring segment pair, HSP) with a score at least S (P-value) is equal to P(S) = 1- exp [ - Kmn exp ( - λs) ], where K and λ depend on scoring rules and residue frequencies. The formula is derived from a so-called extreme value distribution, which is valid for the scores of random HSPs. The expected number HSPs with score at least S (E-value) is E(S) = Kmn exp ( - λs). For gapped alignments, a similar theory can be developed, but some large-scale estimates for randomised sequences are necessary to calculate the parameters used in the analytical formulas. The optimal alignment of two sequences (in terms of scoring system) is not always the optimal one in terms of biological significance. Frequently, the outputs of standard alignment programs require further manual improvement. For instance, it can be done on the basis of known features in the sequences such as codons or structural motifs. 9

10 2.2. Sequence database similarity search One of the very frequent problems is to search in the database of sequences for the entries that share some similarity with a given sequence (query). A general approach here is to align the query sequence to each subject sequence in the database, producing a list of so-called hits: sequences with relatively high similarity to the query. For a database search, a straightforward application of the optimal alignment methods (dynamic programming) is not efficient: this would be too slow. Some approximations are necessary to speed up the database scanning. One of the essential approximations is to break the sequences into short words (stretches of residues) and to look for the words similar in two sequences (an idea is that a true alignment of two sequences should contain at least one common word). The first widely used program of this kind was FASTA (Pearson & Lipman, 1988). The main strategy of the algorithm was first to identify some diagonal regions of high similarity according to a simple scoring for words, with subsequent extension and joining of these initiating diagonals using more accurate computation of scores. The main steps of FASTA are as follows (variations of the described strategy are possible): (Pearson & Lipman, 1988) (A) A search for words, for instance, with at least 4 consecutive matches. Then any diagonal region in the complementarity matrix can be scored using some formula that takes into account the number of words and the distances between them. The best (say, 10) diagonal regions are selected at this step. (B) The best 10 regions are rescored using a substitution matrix, allowing conservative substitutions and shorter runs of identities to contribute to the similarity score. The subregions with maximal scores for each of the best diagonals are identified ( initial regions ). (C) The algorithm joins some of the compatible initial regions, computing an optimal combination. (D) Within some band centered around the highest scoring intitial region, the alignment is recalculated using a dynamic programming algorithm. 10

11 Currently, the most popular program for database similarity search is BLAST (Basic Local Alignment Search Tool). BLAST implements a strategy to find statistically reliable local alignments. The main features of this strategy are as follows: (Altschul S.F. et al., 1990) - Instead of looking for words with exact matches, the algorithm searches for words in a subject sequence that have a score at least T (threshold) in comparison with a word from the query. Thus the word size W could be high without sacrificing sensitivity. Using the parameter T, BLAST minimizes the time spent on sequence regions whose similarity with the query has little chance to achieve a score at least S. In the standard BLAST for nucleic acids, W=11. More fast algorithm is used in the megablast with the default of 28 nucleotides. The standard value for protein-protein BLAST is 3. - When a word hit is found, BLAST extends this local alignment, thus yielding the maximal score possible for the region. The optimal extension is calculated by a dynamic programming algorithm. If the alignment extension decreases the achieved highest score by a value exceeding a "significance decay" parameter, the highest score and the corresponding alignment are returned to the user. - An important feature of BLAST is using mathematically tractable statistical estimates in order to derive the optimal parameters. - The BLAST strategy admits numerous variations, in particular, in the ways to allow gaps in the extensions of hits. In particular, a "two-hit" approach of the initial step of "ungapped BLAST", followed by a gapped extension, turned out to be rather efficient. In ungapped "two-hit" strategy, initial search for two words within some distance on a diagonal of the sequence space is carried out. If ungapped extension of these two words along a diagonal scores better than certain threshold, a further improvement by a gapped extension is searched for. An example below shows one of BLAST hits obtained with the sequence of human interferon-gamma (Accession NP_000610) as a query. (For simplicity, here part of the output information is removed.) The output provides the links to the subject sequence entries in the database and the statistical estimates. In the protein sequence search, a consensus sequence with identical residues and conservative substitutions (together defined as positives ) is also shown BLAST hit example (part of the output) interferon gamma precursor [Gallus gallus] Sequence ID: ref NP_ Length: 164 Score Expect Identities Positives Gaps 75.1 bits(183) 4e-14 46/131(35%) 72/131(54%) 10/131(7%) Query 30 EAENLKKYFNAGHSDVADNGTLFLGILKNWKEESDRKIMQSQIVSFYFKLFKNFKDDQS LK FN+ HSDVAD G + + LKNW E ++++I+ SQIVS Y ++ +N + Sbjct 33 DIDKLKADFNSSHSDVADGGPIIVEKLKNWTERNEKRIILSQIVSMYLEMLENTDKSKPH 92 Query IQKSVETIKEDMNVKFFNSNKKKRDDFEKLTNYSVTDLNVQRKAIHELIQVMAELSP 145 I + + T+K ++ KK D L + DL +QRKA +EL ++ +L Sbjct 93 IKHISEELYTLKNNL-----PDGVKKVKDIMDLAKLPMNDLRIQRKAANELFSILQKLVD 147 Query 146 AAKTGKRKRSQ KRKRSQ Sbjct 148 PP-SFKRKRSQ

12 Different BLAST programs are developed for various tasks. In addition to straightforward protein/protein and nucleotide/ nucleotide alignments, BLAST provides the options to compare possible proteins encoded by nucleotide sequences. Such options are very useful, for instance, for finding unannotated genes in the database (using a protein query, to find nucleotide sequences that might encode similar proteins). The main BLAST programs are as follows: - blastn: nucleotide query - nucleotide database - blastp: protein query - protein database - blastx: nucleotide query translated - protein database - tblastn: protein query - nucleotide database translated - tblastx: nucleotide query translated - nucleotide database translated It is possible to limit the search by some part of the database: for instance, blasting human genome, EST database etc. Also, a special option to align two sequences is developed. Statistical estimates are extremely important for the database similarity search, especially due to the enormous sizes of modern databases. Every BLAST hit is accompanied by such estimates, the E-value being the most demonstrative, showing a number of alignments with the scores better or equal to the given one, expected by chance alone (significant alignments should have rather low value). The alignment scores themselves cannot be directly used for significance evaluation; a so-called "bit score" is more informative because it is not dependent on scoring system. For a local pairwise alignment, its E-value and bit score S' are linked by an expression E = m n 2 -S', where m and n are the lengths of (sufficiently large) sequences (or a sequence and a database). Essentially, BLAST searches may be optimized by user for different levels of expected sequence similaty. For instance, highly similar nucleotide sequences are recommended to search using a so-called megablast option. In megablast, relatively large word sizes W are used (default: W=28). More dissimilar sequences may be searched by the discontinuous megablast algorithm (e.g. W=11 or W=12). The blastn option is intended for the less similar sequences, with W=11 (default) or even W=7. Diminishing word size makes the algorithm slower, but allows to retrieve more distant similarities. Furthermore, an option to automatically adjust parameters for short input sequences may be used as well. BLAST is a very widely used program: (hundreds of) thousands of queries per day. It is accessible via ENTREZ system. If a query is taken from the nucleotide sequence database, submitting of its accession number into the BLAST window is sufficient. Otherwise sequences could be submitted in a so-called Fasta format, a standard in many bioinformatic programs. In Fasta format, the first header line starts with right angle bracket sign ">" followed by some identifiers, the subsequent lines should contain only sequence itself. Fasta format examples: >xxx protein name MTINFEKLT... >xxx nucleic acid sequence name agctaggtcaatc... Fasta format files may also include multiple sequences, written on sequential lines, each sequence being started after its header line. Such files could be used in tasks using more than one sequence. 12

13 Progressive multiple sequence alignment 2.3. Multiple sequence alignment Multiple alignment of many sequences is significantly more complicated task than the pairwise alignment. A direct application of accurate dynamic programming procedure is very demanding computationally and is limited to relatively small numbers of relatively short sequences. The commonly used approach is progressive multiple sequence alignment, based on pairwise alignment of the most similar sequences with gradual addition of the more distant ones. The most widely used is CLUSTAL series of programs. The progressive alignment procedure is divided into three main parts: 1. All possible pairwise alignments between sequences in a given set. These pairwise alignments are used to calculate a distance matrix (the pairs of sequences with high similarity scores are assumed to be characterised by short evolutionary distances). 2. A guide tree is calculated from the distance matrix. The branching order of the guide tree, similar to evolutionary trees, follows the extent of relatedness of sequences. 3. The sequences are progressively aligned according to the tree branching order, starting from the closely related sequences. At every step the pairwise alignment is used to align two (clusters of) sequences belonging to some node in the tree. Alignment of sequence clusters differs from simple pairwise sequence alignment only by calculation of substitution scores from many residues rather than from two (see also below). Within an already built cluster, the positions of sequences are not changed, so that the gaps appearing during alignment to another cluster are introduced simultaneously in all cluster members. Various modifications are possible at every part of the algorithm, for instance, different ways to convert similarity scores to distances, clustering techniques to build a tree, approaches to construct consensus sequences etc. Guide tree. The first versions of CLUSTAL used a procedure of UPGMA (unweighted pair group method with arithmetic averages) to build the guide tree. This procedure is very simple from both conceptual and computational points of view. In this method, the first cluster is formed between the most related sequences (say, seq1 and seq2). A distance between this cluster and any other sequence seqx is calculated as arithmetic average d(seqx, new cluster) = [ d(seqx, seq1) + d(seqx, seq2) ] / 2. Now a new distance matrix can be calculated and the closest sequences (or clusters) selected. The procedure is repeated, so as at every step the distance between any two clusters is calculated as arithmetic average of all pairs of sequences from these clusters. In CLUSTAL W and subsequent packages, the neighbour-joining method (NJ) has been implemented. This method estimates a divergence along each branch of the tree. It builds unrooted trees by a star-decomposition procedure. At every step a decision to join two sequences (or clusters) is done using a modified distance matrix that also takes into account a divergence of these two objects from others. The root of the tree is found by a mid-point method: finding a point where the means of branch lengths on two sides of the root are equal. Sequence weights have been introduced in CLUSTAL W in order to compensate an unequal representation of some motifs in the dataset: for instance, the presence of a subgroup of very similar sequences while another subgroup has few representatives. The weights were introduced to up-weight the divergent sequences and down-weight those that are very similar to other sequences. The weight of a sequence was calculated by adding the values equal to lengths of branches in the tree between the sequence and the root, divided by numbers of sequences sharing them, for instance, as shown below: 0.08 seq.1 W(1)= seq.2 W(2)= seq.3 W(3)= seq.4 W(4)=0.226 seq.5 W(5)= W(1) = (0.2/2) + (0.1/4) = W(2) = (0.3/2) + (0.1/4) = W(5) =

14 The weighting is used for scoring the matching of residues in the alignment process. For a given position in the alignment, the score is weighted average of values M from a substitution matrix. If clusters of sequences are considered, sequences from one cluster are compared to sequences from another. For instance, assume aligning a cluster of 4 sequences and a cluster of 2. The score at any position of the alignment is calculated as weighted average of 4x2=8 matrix values: seq.1 PEEKSAVTAL Score = M(T,V)*W(1)*W(5) seq.2 GEEKAAVLAL + M(T,I)*W(1)*W(6) seq.3 PADKTNVKAA + M(L,V)*W(2)*W(5) seq.4 AADKTNVKAA + M(L,I)*W(2)*W(6) + M(K,V)*W(3)*W(5)! + M(K,I)*W(3)*W(6) seq.5 EGEWQLVLHV + M(K,V)*W(4)*W(5) seq.6 AAEKTKIRSA + M(K,I)*W(4)*W(6) In order to optimize the performance of multiple alignment procedure for sequences of various degrees of similarity, some flexible definitions of scoring are very useful, such as position-specific gap penalties and variable substitution matrices. For instance, gap penalties at some position may be decreased if the previous alignment steps have already identified a gap at this position. The choice of the matrix may be for instance BLOSUM80 for % similarities, BLOSUM 62 for 60-80%, BLOSUM45 for 30-60% and BLOSUM30 for less than 30%. The choice of a suitable matrix is done automatically at each step of the multiple alignment. Multiple alignment quality may be esimated in terms of sequence similarity derived from the alignment. Different scoring systems, so-called objective functions, can be introduced. For instance, for each position (column) of the alignment a quality score may be calculated as e.g. sum-of-pairs similarity score (according to a substitution matrix) and the total alignment score is computed as the sum of all columns. One of the serious problems in straightforward progressive multiple alignment is the local maximum problem: due to a greedy nature of the alignment strategy, there is no guarantee that the global optimal solution will be found. Any misaligned region produced early in the alignment process cannot be corrected later, when new information from other sequences is added. In particular, an incorrect branching order of the initial tree may significantly contribute to the problem Alternatives to the progressive alignment strategy Some modifications of progressive alignment strategy can improve the algorithm performance. For instance, iterative procedures may refine and improve the initial alignments and the tree: the estimates of the scores in the intermediate alignments allow the algorithm to make alignments and the tree mutually consistent, thus improving the accuracy of the final result. For instance, a guide tree can be build using so-called k-mer (or k-tuple) pairwise distances (based on common subsequences of the length k in two sequences) and used for progressive alignment. In turn, this progressive alignment can be used to calculate pairwise distances yielding another tree, which could yield a better alignment. It is possible to generate alternative alignments that can be estimated by an objective function and improved using some algorithm for alignment modification. For instance, the algorithm of T-Coffee package (Notredame et al., 2000) begins with a library of alignments: for each pair of sequences in the dataset, both best global alignment and the best 10 local alignments are computed. In the second step, a multiple sequence alignment is assembled, with a criterion to be maximally consistent with the library alignments. In recent years several algorithms have been developed in order to handle large alignments. Progressive alignments that are based on all pairwise distances become impossible in case of large numbers of sequences (time and memory requirements grow quadratically). It is possible to make guide tree construction faster by using a selection of seed sequences and calculation distances of all sequences to seeds only. This strategy is used in Clustal Omega, which is able to construct accurate alignments for large numbers of sequences. Clustal Omega can use both k-mer distances and those based on pairwise alignments (so-called Kimura distances). Clustal Omega also exploits clustering of sequences to construct subclusters that allow the calculation of profiles that describe alignments within subclusters by probabilities of substitutions, deletions and insertions. The profiles can be aligned to each other (see also next chapter), and can be converted into so-called hidden Markov Models (HMMs), like e.g. in Clustal Omega. Alignments of HMMs can be efficiently computed, yielding the full alignment. Finally, it is important to mention that there is no precise definition of the optimal alignment and there is no guarantee that the best solution is found in any of the programs available. Therefore, an inspection of multiple alignments is a must. Usually multiple alignment packages include special editing algorithms to make the changes manually. 14

15 3. Sequence motifs, patterns and profile analysis Sequence motif is a conserved stretch of residues that confer a specific function and/or structure to a biopolymer. Inferring a function from (new) sequence is often based upon comparative sequence analysis with identification of (partial) sequence similarities. Instead of aligning a sequence against many entries of a database (to find a functional similarity), a library of functionally important motifs can be defined and used for similarity search. For instance, a consensus sequence can be determined as the one that has all the absolutely conserved residues and arbitrary ones at positions where some variability is observed: KSQETPAR KSQETQAR KVPEVQGR K..E...R protein 1 fragment protein 2 fragment protein 3 fragment consensus Some consensus sequences may have degenerate positions. Such motifs can be defined using motif descriptors like generalized profiles: C-x(4,6)-[FYH]-x(5,10)-C-x(0,2)-C-x(2,3)-C C---RKNQ--Y-RHYWSENLFQ-C---FN---C---SL---C (profile or pattern) (sequence) Sometimes a pattern is rather weak : [DNSTAGC]-[GSTAPIMVQH]-x(2)-G-[DE]-S-G-[GS]-[SAPHV]-[LIVMFYWH]-[LIVMFYSTANQH] (example of a pattern from the PROSITE database that will match any sequence which contains D, N, S, T, A, G or C followed by G, S, T, A, P, I, M, V, Q or H followed by 2 residues followed by G etc.) Positions with ambiguities can be also indicated with curly brackets: e.g. {AP} means any amino acid except A and P. The information contained in the alignment of sequences constituting a pattern (consensus) can be converted into a scoring based on a position-specific scoring matrix (PSSM). For any target sequence, PSSM (also called profile) specifies the scores of finding each of the 20 amino acid residues at particular positions of a target sequence. For proteins, PSSM has dimensions 21xL, where L is the motif size, to include possible gaps. Of course, similar approach can be applied for nucleic acids, in searching for protein binding sites in DNA, specific RNA motifs etc. PSSM are also used in gene-finding algorithms (see chapter 4). The approach has been first formulated by Gribskov et al. (1987). In this work, the PSSM profiles were calculated on the basis of some alignments called probe. A modification of dynamic programming algorithm has been also designed to compare any sequence to the profile (compared to the standard alignment program, the major difference is simply in counting the scores from the appropriate PSSM values instead of those from a residue-residue substitution matrix). It has been shown that this approach is very effective in both determining whether a single sequence satisfies a motif and retrieving such sequences from a database. Many modern algorithms exploit similar approach. Frequently the scores in the PSSM are taken as proportional to logarithms of frequencies of finding particular residues at particular positions, normalized by background frequencies: position weight matrices (PWM). 15

16 Example of a profile generated by Gribskov s approach for a 49-residue pattern: positions E L V K A S S sequences ( probe ) G L V E P G S V S V A L N N L P V T P S Y Profile positions A C D E F G H I K L M N P Q R S T V W Y Δ (gap penalties) Based on the Gribskov approach and related algorithms, various programs for construction of profiles and for searching a profile library with a sequence query have been developed. Also, various collections of protein profiles are stored in the specialized databases such as PROSITE, Pfam, PRINTS. Being curated databases, these collections have slightly different focuses. PROSITE was one of the first databases of biologically significant patterns and is still one of the most widely used databases. In PROSITE, each entry is a concise description of the protein family or domain as well as a summary of the reasons leading to the development of the pattern or profile. A very tight relationship exists with the SWISS-PROT protein sequence data bank. Updating of PROSITE and of annotations of the relevant SWISS-PROT entries are often done in parallel. Many programs are developed by different groups to search PROSITE. The InterPro database is an integrated documentation resource with access to several databases of patterns and a combined search program InterProScan. 16

17 PSI-BLAST: a new generation of protein database search programs An approach based on iterated construction of profiles and their use for searching in the protein sequence database resulted in a very sensitive algorithm to search for weak but biologically significant similarities. The algorithm, called Position-Specific Iterated BLAST (PSI-BLAST), is based on repeated cycles of combining the hits obtained by a BLAST search into multiple alignment with the query as an anchor, followed by construction of corresponding PSSM and using this PSSM as a query. As a result, a PSI-BLAST search may retrieve some distantly related sequences as compared to the standard BLAST, because the search becomes directed to find a (presumably) functionally important motif. A simplified description of the approach: (Altshul et al., 1997) Collection of all database sequence segments aligned to the query with E-value below a threshold (default 0.01). Multiple alignment M. Alignment columns involving gaps inserted into the query are ignored, so M has the same length as the query. The matrix scores for a given alignment column could be calculated, using sequence weighting to avoid outvoting a small number of divergent sequences by a large set of closely related ones. Position-specific score matrix (PSSM) is constructed. PSSM is submitted to BLAST as a query (only minor modifications to the code as compared with single sequence querying). Multiple alignment M etc. Some related programs have been developed, for instance, IMPALA (Integrating Matrix Profiles And Local Alignments): a software package for searching a database of PSSM (PSI-BLAST-generated) using a protein sequence as query. Furthermore, the search in the database of profiles is included in current standard BLAST protein-protein program: the detection of known motif (if found) is always displayed before the list of similarity hits. 17

18 4. Gene finding and genome annotation Identification of gene positions in a genome is a complex task. The pipelines for analysis (annotation) of genomic sequences include multiple algorithms, based on various features of coding regions. Prokaryotes and eukaryotes present different problems for gene finding. Mechanisms of translation in these organisms are different. Prokaryotic mrnas may contain several genes, while eukaryotic mrnas are nearly always mono-cistronic. Eukaryotic genes contain introns while prokaryotic ones don t. Transcription of eukaryotic mrnas depends on specific transcription factors, which recognize different motifs in gene promoters. The algorithms for identifying of protein coding regions in nucleic acid sequences can be classified into three main categories: - search by signal (search for a specific pattern or consensus); - search by content (constraints upon sequence due to the coding); - similarity-based search (similarities to known genes and/or RNA transcripts). Finding genes coding for RNA molecules that do not code for proteins (non-coding RNAs) is based on the identification of specific motifs and/or RNA structures Search by signal In prokaryotes, very few promoter sequences (so-called "-10 regions") have exact consensus sequence. A matrix evaluation is possible to search for potential sites. The position-specific score matrices (PSSMs) or position weight matrices (PWMs) [see above] can be based on collections of known promoters. For instance, for the consensus promoter sequence in E.coli the following PWM can be constructed : A C G T (Stormo, 2000) For the consensus sequence TATAAT the score is the sum of numbers shown in bold, 85. However, a choice of a threshold in scoring of sequences according to such a matrix is not a trivial issue. Many promoters can significantly deviate from the consensus, while this consensus may occur outside promoter regions as well. Because of multicistronic operons, a search for translation initiation signals in prokaryotes is more important search than that for promoter localization (in contrast to eukaryotes). In addition to the start-codon ATG, a ribosome binding site (RBS, Shine-Dalgarno site) should be searched for. It is located at 8-9 nucleotides upstream of the start codon, below is the sequence-logo representation of the consensus derived from a dataset of E.coli initiation regions: (Shultzaberger et al., 2001) In eukaryotes there is no Shine-Dalgarno sequence. A scanning model of ribosome is usually valid, so if the 5 end of mrna is known, the first good ATG in the 3 -direction is normally a start codon. The consensus requirements are more simple than in prokaryotes, a so called Kozak consensus of a strong start is AccATGg. For identification of eukaryotic promoters, important signals are the binding sites of various transcription factors. The PSSMs describing these sites are collected in specific databases. Frequently these matrices are used for search in combination with other features (see also below). Consensus nucleotides near exon-intron and intron-exon junctions can be used for searching for positions of introns in eukaryotic genes. However, the matrices based on these consensa are not very specific. Search for signals can fail in many cases, because these signals are not always present or may occur in other context. Thus both false-positive and false-negative results can be expected. Therefore, many algorithms implement the search for content. 18

19 4.2. Search by content Search by content is based on the fact that coding for a protein puts constraints on the nucleotide sequence. The most obvious constraint: a coding sequence should not have a stop-codon within it. In a random sequence, the probability of any codon being followed by an open reading frame (ORF) of over 50 codons is less than 10% and that of over 100 codons less than 1 %. Thus, since most proteins are more than 100 amino acids, searching for long ORFs is an effective way of finding genes in prokaryotes: AUG... (long ORF)... UAG In genomes with high G-C content the reliability of searching for long ORFs decreases because the probability of random encountering a stop-codon (in any frame) is decreased. In eukaryotes, ORF searches are not reliable due to intron presence and relatively short exon lengths. Searching for ORFs can be made more specific by taking into account amino acid occurrence and codon preferences. It turns out that proteins with different functions and from all species share a typical average amino acid composition. This can be used to rate ORFs on their likelihood of being translated. Also, some codons are more frequent than others, and this may be used as well. Some other biases are observed in sequences, and this can be also used to measure the probability of a given sequence to be protein-encoding. Among the most important features, collectively called coding measures, are: - ORF measure (ORF length); - amino acid frequencies; - codon frequencies; - frequencies of bases at certain codon positions (composition measure); - hexamer measure (e.g. the frequency of CAGCAG is relatively high in exons). One of the efficient ways to combine coding measures to define a single value called a discriminant. A discriminant analysis can be done by categorizing samples (in this case, sequences) into two classes and finding the best discriminant that separates the classes. It is very important to note that many coding measures are organism - specific. Therefore many computer programs are specifically designed (trained) to predict protein-coding regions in particular species and are not applicable for other organisms. Modern programs frequently combine the search by signal and search by content into a single algorithm. For instance, search for promoters may score the nucleotide sequences by both the scores defined by some transcription-factor PSSM and the values derived from certain promoter-specific measures (e.g. frequencies of some oligonucleotides). In particular, vertebrate promoter regions frequently have higher contents of CpG nucleotides (CpG islands) than nonpromoter regions. The search for promoters can be significantly enhanced by comparative analysis (see also below) in various species and gene expression data Probabilistic models Basic idea of a probabilistic approach for gene finding is to find the sequence annotation (locations of coding and noncoding sequences) that has the highest probability. The goal is to define a functional role for each nucleotide in the sequence, so as for a DNA sequence S = {n1, n2,..., nl}, (ni is a nucleotide at position i ) the functional role of each nucleotide could be indicated by a functional sequence A = {a1, a2,..., al}, where each ai may take some values depending on a functional role, e.g. 0 if ni is a part of non-coding region, value 1 if ni is a part of a gene etc. In other words, the aim of gene-finding is to determine the most likely functional sequence A for given DNA sequence S. In a Hidden Markov Model (HMM) algorithm, statistical patterns specific for coding/non-coding sequences may be converted into parameters of Markov chain models (a Markov chain represents a series of observations wherein a probability of a given observation is dependent on previous observations). A biologically relevant nucleotide sequence may be modeled in this way, because a probability of observing particular nucleotide is dependent on nucleotides observed at other positions. For instance, within a coding region a probability of observing nucleotides G or A after two nucleotides TA in a coding frame is zero, because this would lead to the stop-codon UAG or UAA (in a non-coding region, there is no such dependence). Many other parameters, e.g. specific nucleotide frequencies, may be introduced into a model. In the framework of such a model, the algorithm is able to compute the probability that any particular functional sequence A underlies a given sequence S. Finding the sequence A corresponding to the highest probability and estimating reliability of the result actually means finding functional regions. 19

20 4.4. Similarity-based search The (significant) similarity of a region in a genome to a known gene in some other genome is one of the most reliable predictors. For instance, the hits of BLASTX (translated nucleotide query vs. protein database) may already predict some genes contained in the query. Another approach is to compare genomic sequences with the databases of expressed sequence tags (EST), such as dbest database of NCBI. EST s are short ( bp) single reads from 3 ends of mrna. Although these sequences are frequently incomplete and contain errors, a match between a genomic sequence and an EST can be used for gene annotation, especially when the EST spans several exons. Even crossspecies EST comparisons (e.g. human vs. mouse EST) can be used for gene searching. Databases of RNA-Seq reads, obtained on genomes of various species, can also be efficiently used for alignments of these transcripts to genomic sequences leading to gene finding Genome annotation strategies The programs for gene finding are generally classified as similarity-based and ab initio. Ab initio programs use both rulebased algorithms (searching for specific signals and/or contents, usually with some scoring) and probabilistic models. There is no ideal universal approach. Annotation strategies combine gene-finding algorithms with a processing of transcriptomics data, yielding elaborated genome annotation pipelines. Of course, the approaches for prokaryotic and eukaryotic genomes differ considerably. Prokaryotic genome annotation pipelines use the predictions of open reading frames by ab initio methods and alignments (in particular, by tblastn algorithm) of the putative ORFs to the previously known proteins. Comparative analysis of prokaryotic ORFs is very important, as the number of available prokaryotic genomes is exceeding a typical number of genes in a genome. Eukaryotic genome annotation pipelines (e.g. NCBI eukaryotic genome annotation pipeline, ENSEMBL genebuild), attempt to integrate maximum amount of relevant information on mrna transcripts, obtained in experiments and available in various databases and genome browsers. The transcripts are aligned to the annotated genome, what yields the positions of expressed exons in the genes. The annotation pipelines may also include gene prediction programs that combine similarity-based information with models yielded by ab initio algorithms. A number of additional procedures can be carried out such as data filtering, protein domain annotation, cross-referencing to external databases etc. It should be noted that large number of transcription products of human genome and other eukariotic genomes do not have protein-coding potential and may be classified into different categories. A number of known so-called non-coding RNAs (ncrnas), such as micrornas and long non-coding RNAs is growing. 20

21 5. Protein structure prediction The development of algorithms for protein structure prediction from sequence has relatively long history. Various secondary (2D) structure prediction algorithms consider the residues in a polypeptide chain to be in three (helix, strand, coil), four (helix, strand, coil, turn) or even eight states and try to predict the location of these states. The majority of methods for protein structure prediction are based on empirical (statistical) approaches. They try to extrapolate the statistics, revealed in the known structures, to other proteins. The most simple observation: - Ala, Gln, Leu and Met are commonly found in α helices; - Pro, Gly, Tyr and Ser usually are not; - Pro a helix-breaker (bulky ring prevents the formation of n/n+4 H-bonds) The algorithm of Chou and Fasman (1978) was a widely used approach in 1970 s-1980 s. It was based on the calculation of a moving average of values that indicated the probability (or propensity) of a residue type to adopt one of three structural states, α-helix, β-sheet and turn conformation. The probabilities were simply the frequencies of a given residue type to be observed in a particular secondary structure, normalized by the frequency expected by chance. Some additional heuristic rules were added that attempted to determine the exact ends of secondary structure elements. In the further developments of these ideas, the algorithm of Garnier et al. (1978) introduced an estimate of the effect that residues within the region eight residues N-terminal to eight residues C-terminal of a given position have on the structure of that position. For each of 20 amino acids, a profile (17 residues long) was defined that quantifies the contribution this residue makes towards the probabilities of other residues to be in one of four states, α-helix, β-sheet, turn and coil. Thus, for every residue the values from +8 positions are added (four values from each), so as four probability profiles are produced, and at any position the highest profile value predicts the structure. Further developments of prediction methods (until the early 1990s) were based on implementing various rules for pattern recognition, in particular, the hydrophobicity patterns, periodicity of helices, different ways for calculating propensities in windows of variable sizes (3-51 residues). However, it seemed that prediction accuracy of such approaches stalled at levels slightly above 60% (percentage of residues predicted correctly in one the three states: helix, strand, and other). Next generation of methods incorporated multiple alignments into predictions. They incorporated a concept that homologous proteins should have similar structures. For example, all naturally evolved proteins with more than 35% pairwise identical residues over more than 100 aligned residues have similar structures (Rost, 1999). Predictions were further improved by application of neural networks, trained on known structures. An efficient algorithm has been developed that combined multiple alignments with neural networks for secondary structure predictions (PSIPRED, Jones, 1999). At the first step, the algorithm uses iterations of PSI-BLAST so as to construct a profile (PSSM) corresponding to the query protein. This profile is used as the input to neural network, and the final output is a 3-state prediction (helix/strand/coil). Multiple algorithms for protein 3D structure prediction are developed. The modern approaches for protein tertiary structure prediction can be divided into three general strategies: (1) Comparative (homology) modeling. If there is a clear sequence homology between the target and one or more known structures, an algorithm tries to obtain the most accurate structural model for the target, consistent with the known set. (2) Fold recognition. Algorithms that try to recognize a known fold in a domain within the target protein. (3) Ab initio methods. Modeling of structures using energy calculations, considering both secondary and tertiary structures. There are overlaps between the methodologies. 21

22 5.1. Homology modeling Homology modeling approach assumes that a sequence similarity between a target protein and at least one related protein with known structure (the template) implies the 3D-similarity as well, and therefore allows one to extrapolate template structures to the target sequence. Main steps in comparative protein structure modeling 1. Identify related structures. 2. Select templates. 3. Align target with templates. 4. Build a model for the target (using information from template structures). 5. Evaluate the model. 6. If model is not satisfactory, repeat the steps 2-5 or 3-5. (Fiser et al., 2001) Identification of structures related to the target sequence is usually done by searching the database of known protein structures (PDB) using the target sequence as the query. For instance, a specific algorithm for template search in PDB is PDB-BLAST that not only builds a multiple alignment using target as a query, but also constructs similar multiple alignments using all found potential templates as queries (here each alignment is called sequence profile, and it can be converted into a matrix). The templates are found by comparing the target sequence profile with each of the template sequence profiles (e.g. by dynamic programming method). This allows to capture essential sequence motifs for a fold to be predicted. Selection of optimal templates is guided by several criteria, such as overall sequence similarity to the target and the quality of the experimentally determined structure. It is not necessary to select only one template: alignments of the target to different templates may be used for model building. These alignments may be refined after template selection. Various algorithms are used for model building, e.g. based on "rigid bodies" or spatial restraint satisfaction (based on core conserved structures obtained from aligned template structures). There are systems for (web-based) automated homology modeling that are able to predict the structures for one or many sequences without human intervention. For instance, SWISS-MODEL ( one of the first servers for protein structure predictions, initiated in 1993 and accessible via the ExPASy web server Fold recognition In case of relatively low sequence similarity of a target protein to the known structures, an attempt may be done to recognize a known fold within the target by a search for an optimal sequence-to-structure compatibility. Mostly this is done by threading algorithms: a target sequence is threaded through templates from the structure database and alternative sequence-structure alignments are scored according to some measure of compatibility between the target sequence with the template structures. The scoring is done using threading potentials, for instance so-called knowledge-based or mean-force potentials. These potentials do not consider physical nature of interactions in proteins and are derived from the statistics of the databases of known structures. Knowledge-based (database-derived) mean force potentials incorporate all forces (electrostatic, van der Waals etc) acting between atoms as well as the influence of the environment (solvent). For the interaction between two residues (a,b) with a sequence separation k and distance r between specified types of atoms (e.g. Cβ Cβ, Cβ N etc.) a general definition of the potential is ( inverse Boltzmann equation ) E ab k (r) = - RT ln [ f ab k (r) ], where the occurrence frequency f ab k (r) is obtained from a database of known structures. A definition of the reference state is very important. A convenient choice for the reference state is Ek (r) = - RT ln [ fk (r) ], where fk (r) is an average value over all residue types. Thus the potential for the specific interaction of residues is ΔE ab k (r) = E ab k (r) - Ek (r) = - RT ln [ f ab k (r) / fk (r) ] A solvation potential for an amino acid residue a is defined as: ΔE a solv (r) = - RT ln ( f a (r) / f(r)), where r is the degree of residue burial, f a (r) is the frequency of occurrence of residue a with burial r, f(r) is the frequency of occurrence of an arbitrary residue with burial r. 22

23 Obviously, threading algorithms are more complex than sequence-sequence alignments and require some approximations. For instance, in ungapped threading, the query sequence is mounted over an equally long part of template fold. The total alignment score is easily computed as sum of pairwise potentials for all query residues. Ungapped threading is mostly used not for real predictions, but rather for testing and adjusting the energy functions. Treatment of gaps in sequence-structure alignments presents a problem because a score that should be computed for a given residue in a query sequence, assumed to be aligned to a residue in a template structure, would depend not only on the type of these two residues (as in sequence-sequence alignments), but on the gaps that may be introduced at other alignment positions. Here usually a so-called frozen approximation is used. In frozen approximation, an appropriate comparison matrix of the size N M (N residues in the template and M in the query) is calculated by replacing the amino acids in the template structure with amino acids from the target sequence one at a time. The rest of the structure is kept intact, and it is assumed that the field created by the native protein will also favour the correct replacement. Although it is a very crude approximation, it is rather efficient, despite the fact that it does not solve the full threading problem. (Sippl, 1993) The subsequent steps are equivalent to sequence-sequence alignment: the scores in the comparison matrix may be used for calculating dynamic programming matrix leading to the final alignment. Often in fold recognition methods several scores are produced, related to different aspects of the sequence-structure alignment, some of the most important: initial sequence profile alignment score, number of aligned residues, length of target sequence, length of template sequence, pairwise energy sum, solvation energy sum. Each of these scores separately may be not sufficient, but they can be used to calculate some determinant score. 23

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database

More information

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl)

Protein Sequence Analysis. BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Protein Sequence Analysis BME 110: CompBio Tools Todd Lowe April 19, 2007 (Slide Presentation: Carol Rohl) Linear Sequence Analysis What can you learn from a (single) protein sequence? Calculate it s physical

More information

Gene Identification in silico

Gene Identification in silico Gene Identification in silico Nita Parekh, IIIT Hyderabad Presented at National Seminar on Bioinformatics and Functional Genomics, at Bioinformatics centre, Pondicherry University, Feb 15 17, 2006. Introduction

More information

Dynamic Programming Algorithms

Dynamic Programming Algorithms Dynamic Programming Algorithms Sequence alignments, scores, and significance Lucy Skrabanek ICB, WMC February 7, 212 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results

More information

MATH 5610, Computational Biology

MATH 5610, Computational Biology MATH 5610, Computational Biology Lecture 2 Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology p.1/24 Announcements Error on syllabus Class

More information

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem.

The String Alignment Problem. Comparative Sequence Sizes. The String Alignment Problem. The String Alignment Problem. Dec-82 Oct-84 Aug-86 Jun-88 Apr-90 Feb-92 Nov-93 Sep-95 Jul-97 May-99 Mar-01 Jan-03 Nov-04 Sep-06 Jul-08 May-10 Mar-12 Growth of GenBank 160,000,000,000 180,000,000 Introduction to Bioinformatics Iosif

More information

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing

More information

DNA. translation. base pairing rules for DNA Replication. thymine. cytosine. amino acids. The building blocks of proteins are?

DNA. translation. base pairing rules for DNA Replication. thymine. cytosine. amino acids. The building blocks of proteins are? 2 strands, has the 5-carbon sugar deoxyribose, and has the nitrogen base Thymine. The actual process of assembling the proteins on the ribosome is called? DNA translation Adenine pairs with Thymine, Thymine

More information

Ch 10 Molecular Biology of the Gene

Ch 10 Molecular Biology of the Gene Ch 10 Molecular Biology of the Gene For Next Week Lab -Hand in questions from 4 and 5 by TUES in my mailbox (Biology Office) -Do questions for Lab 6 for next week -Lab practical next week Lecture Read

More information

Creation of a PAM matrix

Creation of a PAM matrix Rationale for substitution matrices Substitution matrices are a way of keeping track of the structural, physical and chemical properties of the amino acids in proteins, in such a fashion that less detrimental

More information

COMPUTER RESOURCES II:

COMPUTER RESOURCES II: COMPUTER RESOURCES II: Using the computer to analyze data, using the internet, and accessing online databases Bio 210, Fall 2006 Linda S. Huang, Ph.D. University of Massachusetts Boston In the first computer

More information

Lecture for Wednesday. Dr. Prince BIOL 1408

Lecture for Wednesday. Dr. Prince BIOL 1408 Lecture for Wednesday Dr. Prince BIOL 1408 THE FLOW OF GENETIC INFORMATION FROM DNA TO RNA TO PROTEIN Copyright 2009 Pearson Education, Inc. Genes are expressed as proteins A gene is a segment of DNA that

More information

From DNA to Protein: Genotype to Phenotype

From DNA to Protein: Genotype to Phenotype 12 From DNA to Protein: Genotype to Phenotype 12.1 What Is the Evidence that Genes Code for Proteins? The gene-enzyme relationship is one-gene, one-polypeptide relationship. Example: In hemoglobin, each

More information

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1

Comparative Bioinformatics. BSCI348S Fall 2003 Midterm 1 BSCI348S Fall 2003 Midterm 1 Multiple Choice: select the single best answer to the question or completion of the phrase. (5 points each) 1. The field of bioinformatics a. uses biomimetic algorithms to

More information

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012

Bioinformatics. ONE Introduction to Biology. Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012 Bioinformatics ONE Introduction to Biology Sami Khuri Department of Computer Science San José State University Biology/CS 123A Fall 2012 Biology Review DNA RNA Proteins Central Dogma Transcription Translation

More information

Evolutionary Genetics. LV Lecture with exercises 6KP

Evolutionary Genetics. LV Lecture with exercises 6KP Evolutionary Genetics LV 25600-01 Lecture with exercises 6KP HS2017 >What_is_it? AATGATACGGCGACCACCGAGATCTACACNNNTC GTCGGCAGCGTC 2 NCBI MegaBlast search (09/14) 3 NCBI MegaBlast search (09/14) 4 Submitted

More information

CH 17 :From Gene to Protein

CH 17 :From Gene to Protein CH 17 :From Gene to Protein Defining a gene gene gene Defining a gene is problematic because one gene can code for several protein products, some genes code only for RNA, two genes can overlap, and there

More information

Sequence Databases and database scanning

Sequence Databases and database scanning Sequence Databases and database scanning Marjolein Thunnissen Lund, 2012 Types of databases: Primary sequence databases (proteins and nucleic acids). Composite protein sequence databases. Secondary databases.

More information

ELE4120 Bioinformatics. Tutorial 5

ELE4120 Bioinformatics. Tutorial 5 ELE4120 Bioinformatics Tutorial 5 1 1. Database Content GenBank RefSeq TPA UniProt 2. Database Searches 2 Databases A common situation for alignment is to search through a database to retrieve the similar

More information

Database Searching and BLAST Dannie Durand

Database Searching and BLAST Dannie Durand Computational Genomics and Molecular Biology, Fall 2013 1 Database Searching and BLAST Dannie Durand Tuesday, October 8th Review: Karlin-Altschul Statistics Recall that a Maximal Segment Pair (MSP) is

More information

Basic Bioinformatics: Homology, Sequence Alignment,

Basic Bioinformatics: Homology, Sequence Alignment, Basic Bioinformatics: Homology, Sequence Alignment, and BLAST William S. Sanders Institute for Genomics, Biocomputing, and Biotechnology (IGBB) High Performance Computing Collaboratory (HPC 2 ) Mississippi

More information

Computational gene finding. Devika Subramanian Comp 470

Computational gene finding. Devika Subramanian Comp 470 Computational gene finding Devika Subramanian Comp 470 Outline (3 lectures) The biological context Lec 1 Lec 2 Lec 3 Markov models and Hidden Markov models Ab-initio methods for gene finding Comparative

More information

Chapter 12. DNA TRANSCRIPTION and TRANSLATION

Chapter 12. DNA TRANSCRIPTION and TRANSLATION Chapter 12 DNA TRANSCRIPTION and TRANSLATION 12-3 RNA and Protein Synthesis WARM UP What are proteins? Where do they come from? From DNA to RNA to Protein DNA in our cells carry the instructions for making

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Outline Central Dogma of Molecular

More information

Nucleic acids and protein synthesis

Nucleic acids and protein synthesis THE FUNCTIONS OF DNA Nucleic acids and protein synthesis The full name of DNA is deoxyribonucleic acid. Every nucleotide has the same sugar molecule and phosphate group, but each nucleotide contains one

More information

DNA is normally found in pairs, held together by hydrogen bonds between the bases

DNA is normally found in pairs, held together by hydrogen bonds between the bases Bioinformatics Biology Review The genetic code is stored in DNA Deoxyribonucleic acid. DNA molecules are chains of four nucleotide bases Guanine, Thymine, Cytosine, Adenine DNA is normally found in pairs,

More information

Fig Ch 17: From Gene to Protein

Fig Ch 17: From Gene to Protein Fig. 17-1 Ch 17: From Gene to Protein Basic Principles of Transcription and Translation RNA is the intermediate between genes and the proteins for which they code Transcription is the synthesis of RNA

More information

SAMPLE LITERATURE Please refer to included weblink for correct version.

SAMPLE LITERATURE Please refer to included weblink for correct version. Edvo-Kit #340 DNA Informatics Experiment Objective: In this experiment, students will explore the popular bioninformatics tool BLAST. First they will read sequences from autoradiographs of automated gel

More information

From Gene to Protein transcription, messenger RNA (mrna) translation, RNA processing triplet code, template strand, codons,

From Gene to Protein transcription, messenger RNA (mrna) translation, RNA processing triplet code, template strand, codons, From Gene to Protein I. Transcription and translation are the two main processes linking gene to protein. A. RNA is chemically similar to DNA, except that it contains ribose as its sugar and substitutes

More information

DNA is the genetic material. DNA structure. Chapter 7: DNA Replication, Transcription & Translation; Mutations & Ames test

DNA is the genetic material. DNA structure. Chapter 7: DNA Replication, Transcription & Translation; Mutations & Ames test DNA is the genetic material Chapter 7: DNA Replication, Transcription & Translation; Mutations & Ames test Dr. Amy Rogers Bio 139 General Microbiology Hereditary information is carried by DNA Griffith/Avery

More information

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University

Sequence Based Function Annotation. Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Sequence Based Function Annotation Qi Sun Bioinformatics Facility Biotechnology Resource Center Cornell University Usage scenarios for sequence based function annotation Function prediction of newly cloned

More information

Adv Biology: DNA and RNA Study Guide

Adv Biology: DNA and RNA Study Guide Adv Biology: DNA and RNA Study Guide Chapter 12 Vocabulary -Notes What experiments led up to the discovery of DNA being the hereditary material? o The discovery that DNA is the genetic code involved many

More information

Name: Class: Date: ID: A

Name: Class: Date: ID: A Class: _ Date: _ CH 12 Review Multiple Choice Identify the choice that best completes the statement or answers the question. 1. How many codons are needed to specify three amino acids? a. 6 c. 3 b. 12

More information

Independent Study Guide The Blueprint of Life, from DNA to Protein (Chapter 7)

Independent Study Guide The Blueprint of Life, from DNA to Protein (Chapter 7) Independent Study Guide The Blueprint of Life, from DNA to Protein (Chapter 7) I. General Principles (Chapter 7 introduction) a. Morse code distinct series of dots and dashes encode the 26 letters of the

More information

What Are the Chemical Structures and Functions of Nucleic Acids?

What Are the Chemical Structures and Functions of Nucleic Acids? THE NUCLEIC ACIDS What Are the Chemical Structures and Functions of Nucleic Acids? Nucleic acids are polymers specialized for the storage, transmission, and use of genetic information. DNA = deoxyribonucleic

More information

Chapter 13 - Concept Mapping

Chapter 13 - Concept Mapping Chapter 13 - Concept Mapping Using the terms and phrases provided below, complete the concept map showing the discovery of DNA structure. amount of base pairs five-carbon sugar purine DNA polymerases Franklin

More information

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences.

BLAST. Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences. BLAST Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences. An example could be aligning an mrna sequence to genomic DNA. Proteins are frequently composed of

More information

Nucleic Acids: DNA and RNA

Nucleic Acids: DNA and RNA Nucleic Acids: DNA and RNA Living organisms are complex systems. Hundreds of thousands of proteins exist inside each one of us to help carry out our daily functions. These proteins are produced locally,

More information

8/21/2014. From Gene to Protein

8/21/2014. From Gene to Protein From Gene to Protein Chapter 17 Objectives Describe the contributions made by Garrod, Beadle, and Tatum to our understanding of the relationship between genes and enzymes Briefly explain how information

More information

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE?

MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? MODULE 1: INTRODUCTION TO THE GENOME BROWSER: WHAT IS A GENE? Lesson Plan: Title Introduction to the Genome Browser: what is a gene? JOYCE STAMM Objectives Demonstrate basic skills in using the UCSC Genome

More information

BIO 311C Spring Lecture 36 Wednesday 28 Apr.

BIO 311C Spring Lecture 36 Wednesday 28 Apr. BIO 311C Spring 2010 1 Lecture 36 Wednesday 28 Apr. Synthesis of a Polypeptide Chain 5 direction of ribosome movement along the mrna 3 ribosome mrna NH 2 polypeptide chain direction of mrna movement through

More information

C. Incorrect! Threonine is an amino acid, not a nucleotide base.

C. Incorrect! Threonine is an amino acid, not a nucleotide base. MCAT Biology - Problem Drill 05: RNA and Protein Biosynthesis Question No. 1 of 10 1. Which of the following bases are only found in RNA? Question #01 (A) Ribose. (B) Uracil. (C) Threonine. (D) Adenine.

More information

DNA. Is a molecule that encodes the genetic instructions used in the development and functioning of all known living organisms and many viruses.

DNA. Is a molecule that encodes the genetic instructions used in the development and functioning of all known living organisms and many viruses. Is a molecule that encodes the genetic instructions used in the development and functioning of all known living organisms and many viruses. Genetic information is encoded as a sequence of nucleotides (guanine,

More information

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG

ALGORITHMS IN BIO INFORMATICS. Chapman & Hall/CRC Mathematical and Computational Biology Series A PRACTICAL INTRODUCTION. CRC Press WING-KIN SUNG Chapman & Hall/CRC Mathematical and Computational Biology Series ALGORITHMS IN BIO INFORMATICS A PRACTICAL INTRODUCTION WING-KIN SUNG CRC Press Taylor & Francis Group Boca Raton London New York CRC Press

More information

Protein Synthesis

Protein Synthesis HEBISD Student Expectations: Identify that RNA Is a nucleic acid with a single strand of nucleotides Contains the 5-carbon sugar ribose Contains the nitrogen bases A, G, C and U instead of T. The U is

More information

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks

Introduction to Bioinformatics CPSC 265. What is bioinformatics? Textbooks Introduction to Bioinformatics CPSC 265 Thanks to Jonathan Pevsner, Ph.D. Textbooks Johnathan Pevsner, who I stole most of these slides from (thanks!) has written a textbook, Bioinformatics and Functional

More information

90 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 4, 2006

90 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 4, 2006 90 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 4, 2006 8 RNA Secondary Structure Sources for this lecture: R. Durbin, S. Eddy, A. Krogh und G. Mitchison. Biological sequence analysis,

More information

Analysis of Biological Sequences SPH

Analysis of Biological Sequences SPH Analysis of Biological Sequences SPH 140.638 swheelan@jhmi.edu nuts and bolts meet Tuesdays & Thursdays, 3:30-4:50 no exam; grade derived from 3-4 homework assignments plus a final project (open book,

More information

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein?

Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein? Section 10.3 Outline 10.3 How Is the Base Sequence of a Messenger RNA Molecule Translated into Protein? Messenger RNA Carries Information for Protein Synthesis from the DNA to Ribosomes Ribosomes Consist

More information

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming

Lecture 2: Central Dogma of Molecular Biology & Intro to Programming Lecture 2: Central Dogma of Molecular Biology & Intro to Programming Central Dogma of Molecular Biology Proteins: workhorse molecules of biological systems Proteins are synthesized from the genetic blueprints

More information

Chapter 13. From DNA to Protein

Chapter 13. From DNA to Protein Chapter 13 From DNA to Protein Proteins All proteins consist of polypeptide chains A linear sequence of amino acids Each chain corresponds to the nucleotide base sequenceof a gene The Path From Genes to

More information

DNA and RNA. Chapter 12

DNA and RNA. Chapter 12 DNA and RNA Chapter 12 Warm Up Exercise Test Corrections Make sure to indicate your new answer and provide an explanation for why this is the correct answer. Do this with a red pen in the margins of your

More information

Nucleic acids deoxyribonucleic acid (DNA) ribonucleic acid (RNA) nucleotide

Nucleic acids deoxyribonucleic acid (DNA) ribonucleic acid (RNA) nucleotide Nucleic Acids Nucleic acids are molecules that store information for cellular growth and reproduction There are two types of nucleic acids: - deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) These

More information

How can something so small cause problems so large?

How can something so small cause problems so large? How can something so small cause problems so large? Objectives Identify the structural components of DNA and relate to its function Create and ask questions about a model of DNA DNA is made of genes. Gene

More information

DNA - DEOXYRIBONUCLEIC ACID

DNA - DEOXYRIBONUCLEIC ACID DNA - DEOXYRIBONUCLEIC ACID blueprint of life (has the instructions for making an organism) established by James Watson and Francis Crick codes for your genes shape of a double helix made of repeating

More information

Introduction to BIOINFORMATICS

Introduction to BIOINFORMATICS Introduction to BIOINFORMATICS Antonella Lisa CABGen Centro di Analisi Bioinformatica per la Genomica Tel. 0382-546361 E-mail: lisa@igm.cnr.it http://www.igm.cnr.it/pagine-personali/lisa-antonella/ What

More information

Genomics and Gene Recognition Genes and Blue Genes

Genomics and Gene Recognition Genes and Blue Genes Genomics and Gene Recognition Genes and Blue Genes November 1, 2004 Prokaryotic Gene Structure prokaryotes are simplest free-living organisms studying prokaryotes can give us a sense what is the minimum

More information

SSA Signal Search Analysis II

SSA Signal Search Analysis II SSA Signal Search Analysis II SSA other applications - translation In contrast to translation initiation in bacteria, translation initiation in eukaryotes is not guided by a Shine-Dalgarno like motif.

More information

Transcription in Eukaryotes

Transcription in Eukaryotes Transcription in Eukaryotes Biology I Hayder A Giha Transcription Transcription is a DNA-directed synthesis of RNA, which is the first step in gene expression. Gene expression, is transformation of the

More information

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous

Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous Typically, to be biologically related means to share a common ancestor. In biology, we call this homologous. Two proteins sharing a common ancestor are said to be homologs. Homologyoften implies structural

More information

DNA & Protein Synthesis UNIT D & E

DNA & Protein Synthesis UNIT D & E DNA & Protein Synthesis UNIT D & E How this Unit is broken down Chapter 10.1 10.3 The structure of the genetic material Chapter 10.4 & 10.5 DNA replication Chapter 10.6 10.15 The flow of genetic information

More information

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute

Introduction to Microarray Data Analysis and Gene Networks. Alvis Brazma European Bioinformatics Institute Introduction to Microarray Data Analysis and Gene Networks Alvis Brazma European Bioinformatics Institute A brief outline of this course What is gene expression, why it s important Microarrays and how

More information

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch

VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch VL Algorithmische BioInformatik (19710) WS2013/2014 Woche 3 - Mittwoch Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Vorlesungsthemen Part 1: Background

More information

What happens after DNA Replication??? Transcription, translation, gene expression/protein synthesis!!!!

What happens after DNA Replication??? Transcription, translation, gene expression/protein synthesis!!!! What happens after DNA Replication??? Transcription, translation, gene expression/protein synthesis!!!! Protein Synthesis/Gene Expression Why do we need to make proteins? To build parts for our body as

More information

Chapter 12 Packet DNA 1. What did Griffith conclude from his experiment? 2. Describe the process of transformation.

Chapter 12 Packet DNA 1. What did Griffith conclude from his experiment? 2. Describe the process of transformation. Chapter 12 Packet DNA and RNA Name Period California State Standards covered by this chapter: Cell Biology 1. The fundamental life processes of plants and animals depend on a variety of chemical reactions

More information

LABS 9 AND 10 DNA STRUCTURE AND REPLICATION; RNA AND PROTEIN SYNTHESIS

LABS 9 AND 10 DNA STRUCTURE AND REPLICATION; RNA AND PROTEIN SYNTHESIS LABS 9 AND 10 DNA STRUCTURE AND REPLICATION; RNA AND PROTEIN SYNTHESIS OBJECTIVE 1. OBJECTIVE 2. OBJECTIVE 3. OBJECTIVE 4. Describe the structure of DNA. Explain how DNA replicates. Understand the structure

More information

DNA RNA PROTEIN SYNTHESIS -NOTES-

DNA RNA PROTEIN SYNTHESIS -NOTES- DNA RNA PROTEIN SYNTHESIS -NOTES- THE COMPONENTS AND STRUCTURE OF DNA DNA is made up of units called nucleotides. Nucleotides are made up of three basic components:, called deoxyribose in DNA In DNA, there

More information

Hello! Outline. Cell Biology: RNA and Protein synthesis. In all living cells, DNA molecules are the storehouses of information. 6.

Hello! Outline. Cell Biology: RNA and Protein synthesis. In all living cells, DNA molecules are the storehouses of information. 6. Cell Biology: RNA and Protein synthesis In all living cells, DNA molecules are the storehouses of information Hello! Outline u 1. Key concepts u 2. Central Dogma u 3. RNA Types u 4. RNA (Ribonucleic Acid)

More information

DNA Function: Information Transmission

DNA Function: Information Transmission DNA Function: Information Transmission DNA is called the code of life. What does it code for? *the information ( code ) to make proteins! Why are proteins so important? Nearly every function of a living

More information

Name 10 Molecular Biology of the Gene Test Date Study Guide You must know: The structure of DNA. The major steps to replication.

Name 10 Molecular Biology of the Gene Test Date Study Guide You must know: The structure of DNA. The major steps to replication. Name 10 Molecular Biology of the Gene Test Date Study Guide You must know: The structure of DNA. The major steps to replication. The difference between replication, transcription, and translation. How

More information

DNA Structure and Replication, and Virus Structure and Replication Test Review

DNA Structure and Replication, and Virus Structure and Replication Test Review DNA Structure and Replication, and Virus Structure and Replication Test Review What does DNA stand for? Deoxyribonucleic Acid DNA is what type of macromolecule? DNA is a nucleic acid The building blocks

More information

Protein Synthesis. OpenStax College

Protein Synthesis. OpenStax College OpenStax-CNX module: m46032 1 Protein Synthesis OpenStax College This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 By the end of this section, you will

More information

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot

The Genetic Code and Transcription. Chapter 12 Honors Genetics Ms. Susan Chabot The Genetic Code and Transcription Chapter 12 Honors Genetics Ms. Susan Chabot TRANSCRIPTION Copy SAME language DNA to RNA Nucleic Acid to Nucleic Acid TRANSLATION Copy DIFFERENT language RNA to Amino

More information

Chapter 8: DNA and RNA

Chapter 8: DNA and RNA Chapter 8: DNA and RNA Lecture Outline Enger, E. D., Ross, F. C., & Bailey, D. B. (2012). Concepts in biology (14th ed.). New York: McGraw- Hill. 1 8-1 DNA and the Importance of Proteins Proteins play

More information

Protein Bioinformatics Part I: Access to information

Protein Bioinformatics Part I: Access to information Protein Bioinformatics Part I: Access to information 260.655 April 6, 2006 Jonathan Pevsner, Ph.D. pevsner@kennedykrieger.org Outline [1] Proteins at NCBI RefSeq accession numbers Cn3D to visualize structures

More information

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence

Agenda. Web Databases for Drosophila. Gene annotation workflow. GEP Drosophila annotation projects 01/01/2018. Annotation adding labels to a sequence Agenda GEP annotation project overview Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Web databases for Drosophila annotation UCSC Genome Browser NCBI / BLAST FlyBase

More information

Chapter 11. Gene Expression and Regulation. Lectures by Gregory Ahearn. University of North Florida. Copyright 2009 Pearson Education, Inc..

Chapter 11. Gene Expression and Regulation. Lectures by Gregory Ahearn. University of North Florida. Copyright 2009 Pearson Education, Inc.. Chapter 11 Gene Expression and Regulation Lectures by Gregory Ahearn University of North Florida Copyright 2009 Pearson Education, Inc.. 11.1 How Is The Information In DNA Used In A Cell? Most genes contain

More information

BIOB111 - Tutorial activity for Session 13

BIOB111 - Tutorial activity for Session 13 BIOB111 - Tutorial activity for Session 13 General topics for week 7 Session 13: Types of nucleic acids, DNA replication Useful links: 1. Visit this website and use its menu to locate information and practice

More information

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein

PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein PROTEIN SYNTHESIS Flow of Genetic Information The flow of genetic information can be symbolized as: DNA RNA Protein This is also known as: The central dogma of molecular biology Protein Proteins are made

More information

Chapter 14 Active Reading Guide From Gene to Protein

Chapter 14 Active Reading Guide From Gene to Protein Name: AP Biology Mr. Croft Chapter 14 Active Reading Guide From Gene to Protein This is going to be a very long journey, but it is crucial to your understanding of biology. Work on this chapter a single

More information

The Central Dogma of Molecular Biology

The Central Dogma of Molecular Biology The Central Dogma of Molecular Biology In the Central Dogma of Molecular Biology, this process occurs when mrna is made from DNA? A. TranscripBon B. TranslaBon C. ReplicaBon 1 DNA: The ultimate instruction

More information

Computational Molecular Biology Intro. Alexander (Sacha) Gultyaev

Computational Molecular Biology Intro. Alexander (Sacha) Gultyaev Computational Molecular Biology Intro Alexander (Sacha) Gultyaev a.p.goultiaev@liacs.leidenuniv.nl Biopolymer sequences DNA: double-helical nucleic acid. Monomers: nucleotides C, A, T, G. RNA: (single-stranded)

More information

Chapter 17: From Gene to Protein

Chapter 17: From Gene to Protein Name Period This is going to be a very long journey, but it is crucial to your understanding of biology. Work on this chapter a single concept at a time, and expect to spend at least 6 hours to truly master

More information

Gene Expression: Transcription

Gene Expression: Transcription Gene Expression: Transcription The majority of genes are expressed as the proteins they encode. The process occurs in two steps: Transcription = DNA RNA Translation = RNA protein Taken together, they make

More information

NUCLEIC ACIDS AND PROTEIN SYNTHESIS

NUCLEIC ACIDS AND PROTEIN SYNTHESIS NUCLEIC ACIDS AND PROTEIN SYNTHESIS DNA Cell Nucleus Chromosomes is a coiled double helix carrying hereditary information of the cell Contains the instructions for making from 20 different amino acids

More information

BIO 101 : The genetic code and the central dogma

BIO 101 : The genetic code and the central dogma BIO 101 : The genetic code and the central dogma NAME Objectives The purpose of this exploration is to... 1. design experiments to decipher the genetic code; 2. visualize the process of protein synthesis;

More information

Make the protein through the genetic dogma process.

Make the protein through the genetic dogma process. Make the protein through the genetic dogma process. Coding Strand 5 AGCAATCATGGATTGGGTACATTTGTAACTGT 3 Template Strand mrna Protein Complete the table. DNA strand DNA s strand G mrna A C U G T A T Amino

More information

Chapter 10: Gene Expression and Regulation

Chapter 10: Gene Expression and Regulation Chapter 10: Gene Expression and Regulation Fact 1: DNA contains information but is unable to carry out actions Fact 2: Proteins are the workhorses but contain no information THUS Information in DNA must

More information

Higher Human Biology Unit 1: Human Cells Pupils Learning Outcomes

Higher Human Biology Unit 1: Human Cells Pupils Learning Outcomes Higher Human Biology Unit 1: Human Cells Pupils Learning Outcomes 1.1 Division and Differentiation in Human Cells I can state that cellular differentiation is the process by which a cell develops more

More information

STUDY GUIDE SECTION 10-1 Discovery of DNA

STUDY GUIDE SECTION 10-1 Discovery of DNA STUDY GUIDE SECTION 10-1 Discovery of DNA Name Period Date Multiple Choice-Write the correct letter in the blank. 1. The virulent strain of the bacterium S. pneumoniae causes disease because it a. has

More information

Transcription Eukaryotic Cells

Transcription Eukaryotic Cells Transcription Eukaryotic Cells Packet #20 1 Introduction Transcription is the process in which genetic information, stored in a strand of DNA (gene), is copied into a strand of RNA. Protein-encoding genes

More information

Eukaryotic Gene Structure

Eukaryotic Gene Structure Eukaryotic Gene Structure Terminology Genome entire genetic material of an individual Transcriptome set of transcribed sequences Proteome set of proteins encoded by the genome 2 Gene Basic physical and

More information

NCERT MULTIPLE-CHOICE QUESTIONS

NCERT MULTIPLE-CHOICE QUESTIONS 36 BIOLOGY, EXEMPLAR PROBLEMS CHAPTER 6 MOLECULAR BASIS OF INHERITANCE MULTIPLE-CHOICE QUESTIONS 1. In a DNA strand the nucleotides are linked together by: a. glycosidic bonds b. phosphodiester bonds c.

More information

1. DNA, RNA structure. 2. DNA replication. 3. Transcription, translation

1. DNA, RNA structure. 2. DNA replication. 3. Transcription, translation 1. DNA, RNA structure 2. DNA replication 3. Transcription, translation DNA and RNA are polymers of nucleotides DNA is a nucleic acid, made of long chains of nucleotides Nucleotide Phosphate group Nitrogenous

More information

Review of Protein (one or more polypeptide) A polypeptide is a long chain of..

Review of Protein (one or more polypeptide) A polypeptide is a long chain of.. Gene expression Review of Protein (one or more polypeptide) A polypeptide is a long chain of.. In a protein, the sequence of amino acid determines its which determines the protein s A protein with an enzymatic

More information

1. The diagram below shows an error in the transcription of a DNA template to messenger RNA (mrna).

1. The diagram below shows an error in the transcription of a DNA template to messenger RNA (mrna). 1. The diagram below shows an error in the transcription of a DNA template to messenger RNA (mrna). Which statement best describes the error shown in the diagram? (A) The mrna strand contains the uracil

More information

Gene Expression - Transcription

Gene Expression - Transcription DNA Gene Expression - Transcription Genes are expressed as encoded proteins in a 2 step process: transcription + translation Central dogma of biology: DNA RNA protein Transcription: copy DNA strand making

More information

Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype)

Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype) Self-test Quiz for Chapter 12 (From DNA to Protein: Genotype to Phenotype) Question#1: One-Gene, One-Polypeptide The figure below shows the results of feeding trials with one auxotroph strain of Neurospora

More information