Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Size: px
Start display at page:

Download "Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748"

Transcription

1 CP 551: Introduction to Bioinformatics iri Narasimhan ECS 254; Phone: x /29/8 CP551 1

2 enomic Databases Entrez Portal at National Center for Biotechnology Information (NCBI) gives access to: Nucleotide (enbank, EMBL, DDBJ) Protein (PIR, SwissPRO, PRF, and Protein Data Bank or PDB) enome Structure 3D Domains Conserved Domains ene; Uniene; Homoloene; SNP EO Profiles & Datasets Cancer Chromosomes PubMed Central; Journals; Books OMIM Database Neighbors and Interlinking 1/29/8 CP551 2

3 Sequence lignment Why? >gi sp O18381 PX6_DROME Paired box protein Pax-6 (Eyeless protein) MRNLPCLSLIKPSPMEVESSHRHSSSYFYYHLDDECHSVNQLVFV RPLPDSRQKIVELHSRPCDISRILQVSNCVSKILRYYESIRPRISKPRVEVVSKIS QYKRECPSIFWEIRDRLLQENVCNDNIPSVSSINRVLRNLQKEQQSSSSSSNSISKVSV SINVSNVSSRLSSSDLMQPLNSSESSNSESEQEIYEKLRLLNQHPPLEP RPLVQSPNHLRSSHPQLVHNHQLQQHQQQSWPPRHYSSWYPSLSEIPISSPNISVY SPSLHSLSPPNDIESLSIHQRNCPVEDIHLKKELDHQSDESEENSNSNINEDD QRLILKRKLQRNRSFNDQIDSLEKEFERHYPDVFRERLKILPERIQVWFSNRRKWRREEK LRNQRRPNSSSSSSSLDSPNSLSCSSLLSSPSVSINLSSPSLSNVNPL IDSSESPPIPHIRPSCSDNDNRQSEDCRRVCSPCPLVHQNHHIQSNHQHLVPISP RLNFNSSFMYSNMHHLSMSDSYVPIPSFNHSVPLPPSPIPQQDLPSSLYPCHMLRP PPMPHHHIVPDRPVLSQSNLSCSSYEVLSYLPPPPMSSSDSSFSSSS NVPHHIQESCPSPCSSSHFVHSSFSSDPISPVSSYHMSYNYSSNMPSSSSHV PKQQFFSCFYSPWV >gi PX6_HUMN Paired box protein (Oculorhombin) (niridia, type II protein) MQNSHSVNQLVFVNRPLPDSRQKIVELHSRPCDISRILQVSNCVSKILRYYESIRPR ISKPRVPEVVSKIQYKRECPSIFWEIRDRLLSEVCNDNIPSVSSINRVLRNLSEKQQMD MYDKLRMLNQSWRPWYPSVPQPQDCQQQEENNSISSNEDSDEQMRLQLKRKL QRNRSFQEQIELEKEFERHYPDVFRERLKIDLPERIQVWFSNRRKWRREEKLRNQRRQSN PSHIPISSSFSSVYQPIPQPPVSSFSSMLRDLNYSLPPMPSFMNNLPMQPPVPSQ SSYSCMLPSPSVNRSYDYPPHMQHMNSQPMSSLISPVSVPVQVPSEPDMSQYWPR LQ 1/29/8 CP551 3

4 Drosophila Eyeless vs. Human niridia Query: 57 HSVNQLVFVRPLPDSRQKIVELHSRPCDISRILQVSNCVSKILRYYE 116 HSVNQLVFV RPLPDSRQKIVELHSRPCDISRILQVSNCVSKILRYYE Sbjct: 5 HSVNQLVFVNRPLPDSRQKIVELHSRPCDISRILQVSNCVSKILRYYE 64 Query: 117 SIRPRISKPRVEVVSKISQYKRECPSIFWEIRDRLLQENVCNDNIPSVSSIN 176 SIRPRISKPRV EVVSKI+QYKRECPSIFWEIRDRLL E VCNDNIPSVSSIN Sbjct: 65 SIRPRISKPRVPEVVSKIQYKRECPSIFWEIRDRLLSEVCNDNIPSVSSIN 124 Query: 177 RVLRNLQKEQ 188 RVLRNL++K+Q Sbjct: 125 RVLRNLSEKQQ 136 Query: 417 EDDQRLILKRKLQRNRSFNDQIDSLEKEFERHYPDVFRERLKILPERIQV Q RL LKRKLQRNRSF +QI++LEKEFERHYPDVFRERL KI LPERIQV Sbjct: 197 SDEQMRLQLKRKLQRNRSFQEQIELEKEFERHYPDVFRERLKIDLPERIQV 256 Query: 477 WFSNRRKWRREEKLRNQRR 496 WFSNRRKWRREEKLRNQRR Sbjct: 257 WFSNRRKWRREEKLRNQRR 276 E-Value = 2e-31 1/29/8 CP551 4

5 Implications of Sequence lignment Mutation in DN is a natural evolutionary process. hus sequence similarity may indicate common ancestry. In biomolecular sequences (DN, RN, protein), high sequence similarity implies significant structural and/or functional similarity. 1/29/8 CP551 5

6 Discovery based on alignments Early 197s: Simian sarcoma virus causes cancer in some species of monkeys. 97s: infection by certain viruses cause some cells in culture (in vitro) to grow without bounds. Hypothesis: Certain genes (oncogenes) in viruses encode cellular growth factors, which are proteins needed to stimulate growth of a cell colony. hus uncontrolled quantities of growth factors produced by the infected cells cause cancer-like behavior. 983: he oncogene from SSV called v-sis was isolated and sequenced. he partial amino-acid sequence for platelet-derived growth factor (PDF) was sequenced and published. It stimulates the proliferation of normal cells. R.F. Doolittle was maintaining one of the earliest home-grown databases of published amino-acid sequences. Sequence lignment of v-sis and PDF showed something surprising. 1/29/8 CP551 6

7 PDF and v-sis One region of 31 amino acids had 26 exact matches nother region of 39 residues had 35 exact matches. Conclusion: he previously harmless virus incorporates the normal growth-related gene (proto-oncogene) of its host into its genome. he gene gets mutated in the virus, or moves closer to a strong enhancer, or moves away from a repressor. his causes an uncontrolled amount of the product (the growth factor, for example) when the virus infects a cell. Several other oncogenes known to be similar to growth-regulating proteins in normal cells. 1/29/8 CP551 7

8 V-sis Oncogene - Homologies Score E Sequences producing significant alignments: (bits) Value gi gb J SE_SSVPCS2 Simian sarcoma virus v-si gi emb V121.1 RESSV1 Simian sarcoma virus proviral gi gb J SE_SSVPCS1 Simian sarcoma virus LR gi gb U LU2589 ibbon leukemia virus envelo gi ref NM_268.1 Homo sapiens platelet-derived g gi gb BC Homo sapiens, platelet-derived g gi gb M HUMSISPD Human c-sis/platelet-derive /29/8 CP551 8

9 Sequence lignment >gi ref NM_268.1 Homo sapiens platelet-derived growth factor beta polypeptide (simian sarcoma viral (v-sis) oncogene homolog) (PDFB), transcript variant 1, mrn Length = 3373 Score = 954 bits (481), Expect =. Identities = 634/681 (93%), aps = 3/681 (%) Strand = Plus / Plus Query: 115 agggggaccccattcctgaggagctctataagatgctgagtggccactcgattcgctcct 174 Sbjct: 184 agggggaccccattcccgaggagctttatgagatgctgagtgaccactcgatccgctcct 1143 > 21 E D P I P E E L Y E M L S D H S I R S Query: 175 tcgatgacctccagcgcctgctgcagggagactccggaaaagaagatggggctgagctgg 1134 Sbjct: 1144 ttgatgatctccaacgcctgctgcacggagaccccggagaggaagatggggccgagttgg 123 > 61 D L N M R S H S E L E S L R R 1/29/8 CP551 9

10 Sequence lignment Sequence 1 gi Simian sarcoma virus v-sis transforming protein p28 gene, complete cds; and 3' LR long terminal repeat, complete sequence. Length 2984 ( ) Sequence 2 gi Homo sapiens platelet-derived growth factor beta polypeptide (simian sarcoma viral (v-sis) oncogene homolog) (PDFB), transcript variant 1, mrn Length 3373 ( ) 1/29/8 CP551 1

11 Similarity vs. Homology Homologous sequences share common ancestry. Similar sequences are near to each other by some appropriately defined measurable criteria. 1/29/8 CP551 11

12 ypes of Sequence lignments - 1 lobal HIV Strain 1 HIV Strain 2 lobal lignment: similarity over entire length Local Local lignment: no overall similarity, but some segment(s) is/are similar 1/29/8 CP551 12

13 ypes of Sequence lignments - 2 Semi-lobal Semi-global lignment: end segments may not be similar Strain 1 Multiple Strain 2 Strain 3 Multiple lignment: similarity between sets of sequences Strain 4 1/29/8 CP551 13

14 lobal: Local: Sequence lignment Needleman-Wunsch-Sellers (197). Smith-Waterman (1981) Useful when commonality is small and global alignment is meaningless. Often unaligned portions mask short stretches of aligned portions. Example: comparing long stretches of anonymous DN; aligning proteins that share only some motifs or domains. Dynamic Programming (DP) based. 1/29/8 CP551 14

15 Why gaps? Example: Finding the gene site for a given (eukaryotic) cdn requires gaps. What is cdn? cdn = Copy DN DN ranscription cdn ranslation mrn Reverse ranscription Protein 1/29/8 CP551 15

16 How to score mismatches? 1/29/8 CP551 16

17 BLS & FS FS [Lipman, Pearson 85, 88] Basic Local lignment Search ool [ltschul, ish, Miller, Myers, Lipman 9] 1/29/8 CP551 17

18 BLS Overview Program(s) to search all sequence databases remendous Speed/Less Sensitive Statistical Significance reported WWWBLS, QBLS (send now, retrieve results later), Standalone BLS, BLScl3 (Client version, CP/IP connection to NCBI server), BLS URLPI (to access QBLS, no local client) 1/29/8 CP551 18

19 BLS Strategy & Improvements Lipman et al.: speeded up finding runs of hot spots. Eugene Myers 94: Sublinear algorithm for approximate keyword matching. Karlin, ltschul, Dembo 9, 91: Statistical Significance of Matches 1/29/8 CP551 19

20 Why aps? Example: ligning HIV sequences. Strain 1 Strain 2 Strain 3 Strain 4 1/29/8 CP551 2

21 BLS Variants Nucleotide BLS Standard blastn MEBLS (Compare large sets, Near-exact searches) Short Sequences (higher E-value threshold, smaller word size, no low-complexity filtering) Protein BLS Standard blastp PSI-BLS (Position Specific Iterated BLS) PHI-BLS (Pattern Hit Initiated BLS; reg expr. Or Motif search) Short Sequences (higher E-value threshold, smaller word size, no low-complexity filtering, PM-3) ranslating BLS Blastx: Search nucleotide sequence in protein database (6 reading frames) blastn: Search protein sequence in nucleotide db blastx: Search nucleotide seq (6 frames) in nucleotide DB (6 frames) 1/29/8 CP551 21

22 BLS Cont d RPS BLS Compare protein sequence against Conserved Domain DB; Helps in predicting rough structure and function Pairwise BLS blastp (2 Proteins), blastn (2 nucleotides), tblastn (protein-nucleotide w/ 6 frames), blastx (nucleotide-protein), tblastx (nucleotide w/6 framesnucleotide w/ 6 frames) Specialized BLS Human & Other finished/unfinished genomes P. falciparum: Search ESs, SSs, SSs, Hs VecScreen: screen for contamination while sequencing IgBLS: Immunoglobin sequence database 1/29/8 CP551 22

23 BLS Credits Stephen ltschul Jonathan Epstein David Lipman om Madden Scott Mcinnis Jim Ostell lex Schaffer Sergei Shavirin Heidi Sofia Jinghui Zhang 1/29/8 CP551 23

24 Protein Databases used by BLS nr (everything), swissprot, pdb, alu, individual genomes Nucleotide Misc nr, dbest, dbsts, htgs (unfinished genomic sequences), gss, pdb, vector, mito, alu, epd 1/29/8 CP551 24

25 Rules of humb Most sequences with significant similarity over their entire lengths are homologous. Matches that are > 5% identical in a 2-4 aa region occur frequently by chance. Distantly related homologs may lack significant similarity. Homologous sequences may have few absolutely conserved residues. homologous to B & B to C homologous to C. Low complexity regions, transmembrane regions and coiled-coil regions frequently display significant similarity without homology. reater evolutionary distance implies that length of a local alignment required to achieve a statistically significant score also increases. 1/29/8 CP551 25

26 Rules of humb Results of searches using different scoring systems may be compared directly using normalized scores. If S is the (raw) score for a local alignment, the normalized score S' (in bits) is given by S' = ln( ) ln(2) he parameters depend on the scoring system. Statistically significant normalized score, N S' > log E where E-value = E, and N = size of search space. 1/29/8 CP551 26

27 ypes of Sequence lignments lobal HIV Strain 1 HIV Strain 2 Local Semi-lobal Strain 1 Multiple Strain 2 Strain 3 Strain 4 1/29/8 CP551 27

28 lobal lignment: n example V: C W: C iven [I, J] = Score of Matching Recurrence Relation the I th character of sequence V & S[I, J] = MXIMUM { the J th character of sequence W S[I-1, J-1] + (V[I], W[J]), Compute S[I-1, J] + (V[I], ), S[I, J] = Score of Matching S[I, J-1] + (, W[J]) } First I characters of sequence V & First J characters of sequence W 1/29/8 CP551 28

29 lobal lignment: n example V: C W: C S[I, J] = MXIMUM { S[I-1, J-1] + (V[I], W[J]), S[I-1, J] + (V[I], ), S[I, J-1] + (, W[J]) } 1/29/8 CP551 29

30 raceback V: C W: C - - 1/29/8 CP551 3

31 lternative raceback V: - C W: - C - - V: C W: C - - Previous 1/29/8 CP551 31

32 Improved raceback C C /29/8 CP551 32

33 Improved raceback C C /29/8 CP551 33

34 Improved raceback C C V: - C W: - C - - 1/29/8 CP551 34

35 Subproblems Optimally align V[1..I] and W[1..J] for every possible values of I and J. Having optimally aligned V[1..I-1] and W[1..J-1] V[1..I] and W[1..J-1] V[1..I-1] and W[1, J] it is possible to optimally align V[1..I] and W[1..J] O(mn), where m = length of V, and n = length of W. 1/29/8 CP551 35

36 eneralizations of Similarity Function Mismatch Penalty = Spaces (Insertions/Deletions, InDels) = ffine ap Penalties: (ap open, ap extension) = (, ) Weighted Mismatch = (a,b) Weighted Matches = (a) 1/29/8 CP551 36

37 lternative Scoring Schemes C C Match +1 Mismatch 2 ap (-2, -1) V: C W: - C - - 1/29/8 CP551 37

38 Local Sequence lignment Example: comparing long stretches of anonymous DN; aligning proteins that share only some motifs or domains. Smith-Waterman lgorithm 1/29/8 CP551 38

39 Recurrence Relations (lobal vs Local lignments) S[I, J] = MXIMUM { S[I-1, J-1] + (V[I], W[J]), S[I-1, J] + (V[I], ), S[I, J-1] + (, W[J]) } S[I, J] = MXIMUM {, S[I-1, J-1] + (V[I], W[J]), S[I-1, J] + (V[I], ), S[I, J-1] + (, W[J]) } lobal lignment Local lignment 1/29/8 CP551 39

40 Local lignment: Example C C Match +1 Mismatch 1 ap (-1, -1) V: - C W: - - C - - 1/29/8 CP551 4

41 Properties of Smith-Waterman lgorithm How to find all regions of high similarity? Find all entries above a threshold score and traceback. What if: Matches = 1 & Mismatches/spaces =? Longest Common Subsequence Problem What if: Matches = 1 & Mismatches/spaces = -? Longest Common Substring Problem What if the average entry is positive? lobal lignment 1/29/8 CP551 41

42 How to score mismatches? 1/29/8 CP551 42

43 BLOSUM n Substitution Matrices For each amino acid pair a, b For each BLOCK lign all proteins in the BLOCK Eliminate proteins that are more than n% identical Count F(a), F(b), F(a,b) Compute Log-odds Ratio log F( a, b) F( a) F( b) 1/29/8 CP551 43