Meeting Your Protein In Silico. Getting To Know Your Protein. Comparative Protein Analysis. Comparative Protein Analysis. Syllabus. Phylogenetic Trees

Size: px
Start display at page:

Download "Meeting Your Protein In Silico. Getting To Know Your Protein. Comparative Protein Analysis. Comparative Protein Analysis. Syllabus. Phylogenetic Trees"

Transcription

1 Meeting Your Protein In Silico etting o Know Your Protein omparative Protein nalysis: Part I. Phylogenetic rees and Multiple Sequence lignments Robert Latek, Ph Sr. ioinformatics Scientist Whitehead Institute for iomedical Research efine and characterize your favorite sequence Identify homologous sequences Predict function Examine potential mutations Study in 3 Make manuscript figures! WIR ioinformatics ourses, Whitehead Institute, omparative Protein nalysis efinition Use information regarding a group of sequences to determine the function of an undefined sequence. Extract novel information about a protein, or a series of proteins, through comparisons with other, related sequences. pplication What are they? What are their functions? Why are they important? omparative Protein nalysis Identify proteins within an organism that are related to each other and across different species enerate an evolutionary history of related genes Locate insertions, deletions, and substitutions that have occurred during evolution REE RESE!RELPSE (ncestor) ime RESER WIR ioinformatics ourses, Whitehead Institute, WIR ioinformatics ourses, Whitehead Institute, Syllabus Phylogenetic rees Phylogenetic rees Multiple Sequence lignments From rees and MSs to Manuscript Figures Exercises graph representing the evolutionary history of a sequence Relationship of one sequence to other sequences issect the order of appearance of insertions, deletions, and mutations Predict function, observe epidemiology, analyzing changes in viral strains Simple ree WIR ioinformatics ourses, Whitehead Institute, WIR ioinformatics ourses, Whitehead Institute,

2 Rooted ree Shapes ranches intersect at Nodes Leaves are the topmost branches Un!rooted ree haracteristics ree Properties lade: all the descendants of a common ancestor represented by a node istance: number of changes that have taken place along a branch ree ypes ladogram: shows the branching order of nodes Phylogram: shows branching order and distances Phylogram,-",-,--),-".,-+,- WIR ioinformatics ourses, Whitehead Institute, WIR ioinformatics ourses, Whitehead Institute, ree uilding lgorithms Maximum Parsimony istance Methods UPM Neighbor Joining Maximum Likelihood WIR ioinformatics ourses, Whitehead Institute, ene One wo hree Four Maximum Parsimony lignment Site " + Site " Informative rees ree I ree II ree III &Select ree I' &Li( ))' Find the tree that changes one sequence into all of the others by the least number of steps [Focus solely on end product sequences, ignore evolutionary history] Only informative sites are analyzed (no gaps or conserved positions) an be misleading when rates of change vary in different tree branches WIR ioinformatics ourses, Whitehead Institute, istance Methods istance is expressed as the fraction of sites that differ between two sequences in an alignment Sequences with the smallest number of changes (shortest distance) are related taxa WIR ioinformatics ourses, Whitehead Institute, istance Methods - UPM UPM (Unweighted Pair-roup Method with rithmetic mean) Sequentially find pair of taxa with smallest distance between them, and define branching as midpoint of two ssumes the tree is additive and that rate of change is constant in all of the branches &' &' WIR ioinformatics ourses, Whitehead Institute,

3 istance Methods - NJ Neighbor-Joining (NJ): useful when there are different rates of evolution within a tree Each possible pair-wise alignment is examined. alculate distance from each sequence to every other sequence hoose the pair with the lowest distance value and join them to produce the minimal length tree Update distance matrix where joined node is substituted for two original taxa and then repeat process H F E H WIR ioinformatics ourses, Whitehead Institute, F H E F Maximum Likelihood est accounts for variation in sequences Establish a probabilistic model with multiple solutions and determine which is most likely ll possible trees are considered, therefore, only suitable for small number of sequences Maximizes probability of finding optimal tree WIR ioinformatics ourses, Whitehead Institute, ree Reliability Which Method to Use? Probability that the members of a clade are always members of that clade Sample by ootstrapping Random sites of an alignment are randomly sampled so as to create a dataset the same size as the original. he same analysis as applied to the original data set is performed on the bootstrap dataset onstruct a consensus bootstrap tree and compare to the original tree Is there strong sequence similarity? no Is there clearly recognizable sequence similarity? no Maximum Likelihood yes yes Maximum Parsimony istance Methods &Mount( --' WIR ioinformatics ourses, Whitehead Institute, WIR ioinformatics ourses, Whitehead Institute, Syllabus Multiple Sequence lignments Phylogenetic rees Multiple Sequence lignments From rees and MSs to Manuscript Figures Exercises Place residues in columns that are derived from a common ancestral residue MS can reveal sequence patterns emonstration of homology between >2 sequences Identification of functionally important sites Protein function prediction Structure prediction Search for weak but significant similarities in databases esign PR primers for related gene identification enome sequencing: contig assembly REE RESER RESE RELPSE Seq RE--E- Seq RES--E- Seq RES--ER Seq -RELPSE WIR ioinformatics ourses, Whitehead Institute, WIR ioinformatics ourses, Whitehead Institute,

4 Multiple Sequence lignment WIR ioinformatics ourses, Whitehead Institute, 2004 Multiple Sequence lignment 19 pproaches WIR ioinformatics ourses, Whitehead Institute, 2004 lobal Progressive lignment Optimal lobal lignments -ynamic programming uild matrices with every possible combination and search for optimal solution lign 10 sequences of 100 aa length = possibilities Optimal in the mathematical sense lobal Progressive lignments - Match most common sequences together lobal Iterative lignments - Multiple re-building attempts to find best alignment Local alignments heuristic approach that utilizes phylogenetic information to assist in routing the alignment (clustalw/clustalx) Feng & oolittle1987, Higgins and Sharp 1988 Most alike sequences are aligned together in order of their similarity (tree-based), a consensus is determined and then aligned to next most similar sequence WIR ioinformatics ourses, Whitehead Institute, Iterative Multiple lignment Initial Progressive lignment uild ree Weight ased On Subgroup lignments WIR ioinformatics ourses, Whitehead Institute, 2004 Seq2 Seq3 Seq4 VMR VMK MK MV VMR VMK = VMR/K VMR/K MK = V/ M R/K VMR V/ M R/K VMK MV MK = V/ M V/R/K WIR ioinformatics ourses, Whitehead Institute, MS and ree Relationship he optimal alignment of several sequences can be thought of as minimizing the number of mutational steps in an evolutionary tree for which the sequences are the leaves (Mount, 2001) -L N F L S seq Y to F -L N Y L S +K Iterate MS Seq1 MV Profiles, locks, Patterns Repeatedly re-align subgroups of sequences into a global alignment to improve alignment score (Mount, 2001) Start with a progressive alignment and tree Recalculate pair-wise scores during progressive alignment, use new scores to rebuild the tree, which is used to improve alignments 20 N F - S seq N K Y L S seq &Mount( --' N Y L S seq 23 WIR ioinformatics ourses, Whitehead Institute,

5 Syllabus Manuscript Figures 101 Phylogenetic rees Multiple Sequence lignments From rees and MSs to Manuscript Figures Exercises MPLQ RSV ILWMI Sequence >gb F >tigr >pdb 1ao3 LS VMR- VMK MKS MV - lignment ree Figure Figure WIR ioinformatics ourses, Whitehead Institute, WIR ioinformatics ourses, Whitehead Institute, Find Related Sequences LS MLEILKLVKSKKLSSSSSYLEELQRPVSFEPQLSERWNSKENLLPSENPNLFVLY FVSNLSIKEKLRVLYNHNEWEQKNQWVPSNYIPVNSLEKHSWYHPVSRNEYL LSSINSFLVRESESSPQRSISLRYERVYHYRINSKLYVSSESRFNLELVHHHSVLI LHYPPKRNKPVYVSPNYKWEMERIMKHKLQYEVYEVWKKYSLVVKLKEMEV EEFLKEVMKEIKHPNLVQLLVREPPFYIIEFMYNLLYLRENRQEVNVVLLYMQISS MEYLEKKNFIHRLRNLVENHLVKVFLSRLMYHKFPIKWPESLYNKFSIKS VWFVLLWEIYMSPYPILSQVYELLEKYRMERPEPEKVYELMRWQWNPSRPSFEIH QFEMFQESSISEVEKELKQVRVSLLQPELPKRSRREHRVPEMPHSKQES PLHEPVSPLLPRKERPPELNEERLLPKKKNLFSLIKKKKKPPPKRSSSFREMQPER REEERISNLFPLPKSPKPSNVPNLRESSFRSPHLWKKSSLSSRL EEESSSKRFLRSSSVPHKEWRSVLPRLQSRQFSSFHKSEKPLPRKREN RSQVRVPPPRLVKKNEEEVFKIMESSPSSPPNLPKPLRRQVVPSLPHKEEKS LPEPVPSKSPSKPEESRVRRHKHSSESPRKKLSRLKPPPPPPSK KPSQSPSQEEVLKKSLVVNSKPSQPELKKPVLPPKPQSKPSPISPP VPSLPSSSLQPSSFIPLISRVSLRKRQPPERISIKVVLSELLISRNSEQM SHSVLEKNLYFVSYVSIQQMRNKFFREINKLENNLRELQIPSPQFSKLLSS VKEISIVQR WIR ioinformatics ourses, Whitehead Institute, ompile & lign Sequences ompile Sequences into FS format >Human MPLYKFSW >Mouse MSYILQINS >Rat MKKP.. >Murine_Leukemia_Virus MSR. lign P: www-igbmc.u-strasbg.fr/ioinfo/lustalx/op.html OS X: Web: pir.georgetown.edu/pirwww/search/multaln.html WIR ioinformatics ourses, Whitehead Institute, reate tree 3. uild ree lustalx Neighbor Joining method raw tree reeview: taxonomy.zoology.gla.ac.uk/rod/treeview.html Web: iubio.bio.indiana.edu/treeapp/treeprint-form.html WIR ioinformatics ourses, Whitehead Institute, reate Figures MSs are often multipage onvert to PF with crobat istiller or open with hostview ( or Extract pages individually and save as separate PF/PS files Open images in favorite illustration program Export annotated alignments/trees to Powerpoint Publish paper, give award winning presentation! WIR ioinformatics ourses, Whitehead Institute,

6 Exercise I LS your sequence NI LS ollate and edit sequences in a text editor Perform multiple sequence alignment lustalx Refine lignment in Jalview uild Phylogenetic ree lustalx and reeview Manage Postscript Files hostview reate Figure Illustrator - > Powerpoint References ioinformatics: Sequence and genome nalysis. avid W. Mount. SHL Press, ioinformatics: Practical uide to the nalysis of enes and Proteins. ndreas. axevanis and.f. Francis Ouellete. Wiley Interscience, ioinformatics: Sequence, structure, and databanks. es Higgins and Willie aylor. Oxford University Press, WIR ioinformatics ourses, Whitehead Institute, WIR ioinformatics ourses, Whitehead Institute,