Outline. Evolution. Adaptive convergence. Common similarity problems. Chapter 7: Similarity searches on sequence databases

Chapter 7: Similarity searches on sequence databases All science is either physics or stamp collection. Ernest Rutherford Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing BLAST PSI-BLAST Evolution Evolution of forelimbs of vertebrates Evolution has duplicated and shuffled bits and pieces of molecules to produce new linear arrangements that combine function in novel ways. Regions of similarity often suggest an evolutionary tie and/or common functional properties between very different molecules. Adaptive convergence Shared morphology does NOT necessarily imply common ancestry When similarity is due to common ancestry, we call it homology Common similarity problems Start with a query sequence with unknown properties and search within a database of millions of sequences to find those which share similarity with the query. Start with a small set of sequences and identify similarities and differences among them. In many sequences or very long sequences, detect commonly occurring patterns. 1

Common similarity problems (rephrased) One against many Common among several Common part of many How homology helps Given molecular sequences X and Y: X ~ Y AND INFO(Y) INFO(X) ( ~ means similar) Are the sequences similar? Why is similarity important Similar sequences (homologues) often derive from the same ancestor, share the same structure, and have similar biological function. Extrapolation of findings. Similarity judgements should be based on: The types of changes or mutations that occur within sequences. Characteristics of those different types of mutations. The frequency of those mutations. Crude similarity thresholds Proteins 25% similarity Nucleic acids 75% similarity Below 25/75% is twilight zone everything is possible. 2

Refined similarity thresholds E-value expectation value: how likely the result is by chance. Length of the segments similar between two sequences. Patterns of aa conservation. Number of indels. Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing BLAST PSI-BLAST BLAST Basic Local Alignment and Search Tool BLAST at NCBI http://www.ncbi.nlm.nih.gov/blast and BLAST at EMBnet http://www.ch.embnet.org/software/ablast.h tml use different databases yield slightly different results. Standard BLAST uses substitution matrix (i.e. PAM or BLOSUM) to reward identity match, gives positive points for similar aa, and penalties for different aa. Different BLASTs blastp : compares your protein with a protein database. tblastn : compares your protein with a nucleotide database (t is for translated). Protein vs. nucleotide database BLASTing protein at NCBI Six ways to translate DNA to protein direct and reverse strand 3 reading frames each. tblastn runs all 6 possibilities. Input your sequence 5 to 3 (N to C). You run query sequence against target databases to get hits or matches. 3

blastp input by accession no. blastp input by sequence FASTA CD conserved domain search deselected Intermediate result Waiting for results Waiting for results European server http://www.ebi.ac.uk/blast2/ If page indicates that search would take more than 10 minutes than use other BLAST server. Morning use USA server http://genome.wustl.edu/blast/client.pl. Afternoon use Japan server http://www.ddbj.nig.ac.jp/search/blaste.html. Click just once. Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing BLAST PSI-BLAST 4

BLAST output Graphics shows where your query is similar to others. Hit list ranked names of similar sequences. Alignments one to one. Parameters used for search. Graphics part Hit list pass the mouse over the bar to see more Hit list Accession number and the description. Score (bits) must be >50 to be reliable. E-value - expectation of match by chance (given the database), must be <0.001 to be reliable. Alignment Alignments do NOT lie if you know how to look at them. x means masking (low-complexity segment) + means similarity consensus line 5

Saving BLAST results BLASTing nucleic acid Reproducibility in time is low because database, BLAST program, and default program parameters change in time. Convert to pdf. Save as Complete Webpage. Save Picture as. Common mistake Friends of my friends are my friends. NOT necessarily. BLAST runs local alignments, hits are NOT transitive unless the alignments are overlapping. Sequence 1: AAAAATTTTTT Sequence 2: AAAAA Sequence 3: TTTTTT BLASTs for DNA blastn - DNA against DNA; for noncoding DNA. tblastx - tdna against tdna; for protein discovery. blastx tdna against protein; for proteins encoded in your query DNA and for DNA sequence of unknown quality. Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing BLAST PSI-BLAST Using filters Correct database (nt/protein) Organism database Repetitions 6

Use of BLAST Finding genes in a genome Predicting a protein function Predicting a protein 3-D structure Finding protein family members Finding genes in a genome Quick and dirty BLAST way: Cut your genome to 5kb overlapping sequences, use blastx against nonredundant (NR) protein database for every piece. Proper way: Run gene prediction software. Predicting a protein function Quick and dirty BLAST way: Use blastp against Swiss-Prot. If >25% identity over the whole protein length then you know the function of your protein. Proper way: Conduct domain analysis and wet-lab (bluefingers) experiments. Predicting a protein 3-D structure Quick and dirty BLAST way: Use blastp against PDB. If >25% identity over the whole protein length then you know the probable structure of your protein. Proper way: Conduct homology modelling, X-ray, and NMR experiments. Finding protein family members Quick and dirty BLAST way: Use blastp (or PSI-BLAST) against nonredundant protein family. Make a multiple sequence alignment of all members of the family and draw a phylogenetic tree. Proper way: Clone new family members using PCR. BLAST parameters Power is nothing without control. Reasons for changing default parameters: sequence has a biased composition (use masking), NO results (change substitution matrix and gap penalties), too many results (change NR database to Swiss-Prot, use Entrez keyword with Boolean operators, and increase E-value threshold), testing robustness of findings. 7

BLAST protein masking Low-complexity regions (many prolines, many glutamic acids) false matches. Masking by replacement with X. Use InterPro, CD search, or Pfscan to find and mask common domains (i.e. Zn finger domain and fibronectin domain). BLAST DNA masking BLAST output 60% of human DNA are repeats Large-scale genome sequencing brings errors - remains of vectors in human database. Outline Why is similarity important BLAST Protein and DNA Interpreting BLAST Individualizing BLAST PSI-BLAST PSI-BLAST Position Specific Iterated BLAST For distantly related sequences. 1st iteration finds relatives by blastp with BLOSUM62 matrix. 2nd iteration uses results of the 1st run to generate a new substitution matrix (one aa has different penalizations on different positions) and looks for more relatives. 3rd 8

PSI-BLASTing protein PSI-BLASTing protein http://www.ncbi.nlm.nih.gov/blast/bla st.cgi?cmd=web&layout=twowindo ws&auto_format=semiauto&align MENTS=250&ALIGNMENT_VIEW=Pairw ise&client=web&composition_bas ED_STATISTICS=on&DATABASE=nr&C DD_SEARCH=on&DESCRIPTIONS=500 &ENTREZ_QUERY=(none)&EXPECT=10 &FORMA PSI-BLASTing format PSI-BLAST output check box will be used for a next iteration, can be edited green dot used in previous iterations new - reported for the first time as hit Avoiding mistakes with PSI-BLAST When we look for hemoglobin and after 2nd iteration alcohol dehydrogenase appears among hits, it is time to stop. Read annotation to distinguish between interesting finding and false finding. Check domains by InterPro/CD server/ Pfscan and cut proteins to 200 aa pieces with one domain each. BLAST alternatives Smith and Waterman ssearch : the slowest, more accurate http://ori.nibb.ac.jp/sit/ssearch.html FASTA slower, good for DNA (originally fast all) http://www.ebi.ac.uk/fasta33/ BLAT for locating cdna in a genome, keeps an index of the entire genome in memory. The index consists of all non-overlapping 11-mers except for those heavily involved in repeats http://genome.ucsc.edu/cgi-bin/hgblat FLASH Fast alignment Algorithm for finding Structural Homology http://140.109.42.177/flash/. 9

Úkol 1 We compared 4 homologs of papain sequence by structural comparison: kiwi aktinidin, human prokatepsins L and B, Staphylococcus aureus stafopain. Run Papaia papain through BLAST and PSIBLAST. Which homologs (out of 4 mentioned above) is hit by BLAST and PSIBLAST? Úkol 2 How many cytokinin dehydrogenase sequences are in databases? 10