Making Sense of DNA and Protein Sequences. Lily Wang, PhD Department of Biostatistics Vanderbilt University

Making Sense of DNA and Protein Sequences Lily Wang, PhD Department of Biostatistics Vanderbilt University 1

Outline Biological background Major biological sequence databanks Basic concepts in sequence comparison methods Substitution matrices and gap penalties Methods for aligning two sequences Statistical distributions of alignment scores Software - BLAST 2

DNA The genetic material that is physically transmitted from parent to offspring Double stranded helix Within a single chain, nucleotide base+ sugar+ phosphate group A,G,C,T refers to the base A paired with T, G paired with C 3

Proteins The ultimate cellular activities are influenced through DNA encoded proteins 4

5 The Central Dogma of Molecular Biology

Splicing, Introns and Exons 6

The Growth of Biological Data 7

8 Goals of Functional Genomics Understand the functions of genes and their interplay with proteins and the environment to create complex, dynamic living systems

Sequence Databases - Primary Nucleotide Databases Contains sequences derived from sequencing a biological molecule that exist in a test tube, somewhere in a lab. They do not represent sequences that are a consensus of a population. GenBank, DDBJ, EMBL 9

Sequence Databases An Example Record http://www.ncbi.nlm.nih.gov/sitemap/samplerecord.html 10

11 Secondary Databases RefSeq These are curated databases Many sequences are represented more than once in GenBank, this leads to huge degrees of redundancy. Goal: provide a reference for each molecule in the central dogma (DNA, mrna, and protein). RefSeq accession numbers format: 2+6 format Experimentally determined sequence data: NT_123456 Genomic contigs (DNA) NM_123456 mrnas NP_123456 Proteins computational predictions from raw DNA sequences XM_123456 Model mrna XP_123456 Model proteins

Protein Sequences UniProt Knowledgebase (http://www.pir.uniprot.org/) SWISSPROT - manually annotated records and curator evaluated computational analysis TrEMBL - computationally analyzed records awaiting for manual annotation Source: translation of all coding sequences (CDS) found in DDBJ/EMBL/GenBank, PDB entries, sequences submitted directly to UniProt, and PIR-PSD Non-redundancy - describe in a single record, all protein products derived from a certain gene Extensive cross reference - to GenBank, 2D-PAGE data, protein structure databases, protein domain and family characterization databases, species-specific data collections, and disease databases. 12

Amino Acid Codes A sample record: http://ca.expasy.org/cgi-bin/get-sprot-entry?p09651 13

Sequence Comparisons It s much easier to determine sequences of genes and proteins than to determine their structure or function New sequences are adapted from pre-existing sequences rather than invented anew. The more conserved amino acids in similar proteins from different species are ones that play an essential role in structure and function. Significant similarities between sequences often give important clues on phylogeny, structure and functions. 14

Origins of Genes Having a Similar Sequence 15

17 Sequence Comparisons Human DNA has about 30,000 genes, > 50% of their functions are still unknown Many of human proteins share similarities with other organisms By experimentation and by comparing genes and proteins with those already known in the databases, our goal is to determine functions of newly discovered genes and proteins

Decide on a scoring system Pairwise Comparisons Align a given set of sequences to find the best matching region(s) Assign a score for the comparison between sequences We wish to evaluate the statistical significance of the score. 18

19 Substitution Matrices A substitution matrix or score matrix is a matrix where is in position i, j of the matrix. s( a, a ) i j BLOSUM 62 Matrix

Not all amino acids are equal Substitution Matrices Some are more easily substituted than others Some mutations occur more often Some substitutions are kept more often Mutations tend to favor some substitutions Some amino acids have similar codons (for example TTT & TTC for Phe, TTA & TTG for Leu) They are more likely to be changed from DNA mutation Selection tends to favor some substitutions Some amino acids have similar properties/structure They are more likely to be kept 20

Log Odds Score Given a pair of aligned sequences, we want to assign a score to the alignment that gives a measure of the relative likelihood that the sequences are related as opposed to being unrelated. Consider the LRS for two sequences x and y, pxy i i Pxy (, match model) p i xiy = = Pxy (, random model) q q qq x y i x y i i i i i i p ab = s( a, b) = log The log-odds ratios s( xi, yi) where i qq a b is the log likelihood ratio of the residue pair (a,b) occurring as an aligned pair, as opposed to an unaligned pair. i 21

Scoring An alignment Score of an alignment is the sum of the scores of all pairs of residues in the alignment sequence 1: TCCPSIVARSN sequence 2: SCCPSISARNT 1 12 12 6 2 5-1 2 6 1 0 => alignment score = 46 Maximal Segment Pair (MSP) - Given two protein sequences, the pair of equal length segments that, when aligned, have the greatest aggregate score is called the Maximal Segment Pair (MSP). An MSP may be of any length; its score is the MSP score. 22

Theories on Substitution Matrices Among the MSPs from the comparison of random sequences, the amino acids a i and a j are aligned with target frequency S ij (Altschul, 1991) q = p p e λ ij i j Now among alignments representing distant homologies, the amino acids are paired with certain characteristic frequencies. Only if these correspond to a matrix s target frequencies, it has been argued, can be matrix be optimal for distinguishing distant local homologies from similarities due to chance. (Karlin & Altschul, 1990) Any substitution matrix is implicitly a log-odds matrix, with a specific target distribution for aligned pairs of amino acid residues. 23

Substitution Matrices The PAM family PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. Other PAM matrices are extrapolated from PAM1. The BLOSUM family BLOSUM matrices are based on local alignments. BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. Though BLOSUM 62 is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. 24

25 Selecting Optimal Matrices (Altschul, 1991; Henikoff & Henikoff 1993; Wheeler, 2003) Matrix Best Use Similarity (%) PAM40 Short Alignments that are highly similar 70-90 PAM160 Detecting members of a protein family 50-60 PAM250 Longer alignments of more divergent sequences ~30 BLOSUM90 Short alignments that are highly similar 70-90 BLOSUM80 Detecting members of a protein family 50-60 BLOSUM62 Most effective in finding all potential similarities 30-40 BLOSUM30 Longer alignments of more divergent sequences <30

26 Global vs. Local Alignment Global Alignment aligns the entire sequences, use all characters up to both ends of each sequence Local Alignment aligns only the best matching parts of the sequences that gives the highest matching scores rationale: distantly related proteins may share only isolated regions of similarity

27 Dot Matrix Methods for Pairwise Alignment reveals the presence of insertions / deletions, and direct/inverted repeats; should be considered first choice Dynamic Programming guarantees the optimal alignment, but can be slow k-tuple heuristic method does not guarantee optimal alignment, but is fast; implemented in BLAST and FASTA

Dot Matrix Method 28

Heuristic Method Basic Local Alignment Search Tool (BLAST) Question: What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? BLAST finds the highest scoring locally optimal alignments between a query sequence and a database. Very fast algorithm, but does not guarantee optimal alignment Can be used to search extremely large databases Sufficiently sensitive and selective for most purposes 29

30 BLAST For a given word length w (usually 3 for proteins) and a given score matrix: Create a list of all words (w-mers) that can can score >T when compared to w-mers from the query.

31 BLAST Each neighborhood word gives all positions in the database where it is found (hit list).

32 BLAST The program tries to extend matching segments (seeds) out in both directions by adding pairs of residues. Residues will be added until the incremental score drops below a threshold.

The Five BLAST Programs Program Database Query Typical uses BLASTN Nucleotide Nucleotide Mapping oligonucleotides, cdnas, and PCR products to a genome; screening repetitive elements; crossspecies sequence exploration; annotating genomic DNA; clustering sequencing reads; vector clipping. BLASTP Protein Protein Identifying common regions between proteins; collecting related proteins for phylogenetic analyses. BLASTX Protein Nucleotide translated into protein Finding protein-coding genes in genomic DNA; determining if a cdna corresponds to a known protein. TBLASTN Nucleotide translated into protein Protein Identifying transcripts, potentially from multiple organisms, similar to a given protein; mapping a protein to genomic DNA. TBLASTX Nucleotide translated into protein Nucleotide translated into protein Cross-species gene prediction at the genome or transcript level; searching for genes missed by traditional methods or not yet in protein databases. 33

Distribution of Pairwise Local Alignment Scores (Karlin-Altschul Statistics) - Assumptions There is at least one positive score The expected score must be negative The letters of the sequences are iid. The sequences are infinitely long. Alignment doesn t contain gaps. 34

Statistical Distribution Pairwise Local Alignment Scores Given 2 random protein sequences, the number of distinct, or "locally optimal" MSPs with scores at least S, expected to occur simply by chance is where N K = = product of seq lengths explicit calculable parameter i, j KNe λs S λ = unique positive solution to ppe λ ij i j = 1 This is the E value reported by BLAST. 35

36 Statistical Significance of Pairwise Alignment Analysis of Headruns Two sequences A, A,..., A and B,..., B with same length The letters are i.i.d. 1 2 n 1 Consider fixed alignment, that is, no shifts Consider exact matching, that is no mismatch or indels n

37 Alignment Scores The highest local alignment score H( A, B) = max{ s( I, J): I A, J B} indicates the best matching region along sequences A and B ( A, B) R = max{ for k = 1 to m, Flip a coin n times with H n m : A i + k = B i + k 0 i n m} p = Pr( A i = Bi ) for heads each time, the longest run of heads corresponds to R n

38 Analysis of Headruns - Waterman (1995) m Heuristics: a headrun of length m has a probability p there are about n possible headruns so m E(# headrtuns of length m) np If the largest run is unique, its length R n should satisfy = np R n 1, which has a solution R = log n n 1 Theorem 1 Let A1, A2,..., B1, B2,... be independent and identically distributed with < p Pr( A = B ) 1 0 1 1 < p Then Pr lim n R log 1 n p n = 1 = 1

Chen-Stein Method of Poisson Approximation 39

Analysis of Headruns 40

42 Distribution of Pairwise Local Alignment Scores Karlin and Altschul (1990) Of course, BLAST score theory is much more complicated Unequal sequence lengths Shifting Score matrix ungapped case Karlin-Altschul Local Alignment Scores distribution Pr( S > x) 1 exp( kmne λx )

43 E value and p-value E value is The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. p-value - the probability of an alignment occurring with the score in question or better.

E-value or p-value It is more appropriate to rank the importance of an alignment score by the p-values since matches with long sequences can yield larger scores simply due to sequence length. The BLAST programs report E-value rather than P-values because it is easier to understand the difference between, for example, E-value of 5 and 10 than P-values of 0.993 and 0.99995. When E < 0.01, P-values and E-value are nearly identical. Use p-value for evaluation of cases on the boundary of statistical significance. 44

45 References BLAST Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215:403 410. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Research, 25(17) 3389-3402 Substitution Matrices Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. (1978) Altas of Protein Sequence and Structure. National Biomedical Research Foundation. Washington, DC, 5, 345-352 Henikoff, S., Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proceedings of National Academy of Sciences, 89, 10915-10919 Altschul, S.F. (1991) Amino acid substitution matrices from an information theoretical perspective. Journal of Molecular Biology 219, 555-565

46 Reference Distributions of Pairwise Sequence Algnment Scores Karlin, S., Altschul, S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of National Academy of Sciences 87, 2264-2268 Arratia, R., Goldstein, L., Gordon, L. (1989) Two moments suffice for Poisson approximation: The Chen-Stein method. Annals of Probability, 17, 9-25 Waterman, M.S., Vingron, M. (1994) Sequence comparison significance and poisson approximation. Statistical Science 9, 367-381 Siegmund Dl, Yakir, B. (2000) Approximate p-values for local sequence alignments. Annals of Statistics, 28, 3, 657-680

Thank You! 47