Why learn sequence database searching? Searching Molecular Databases with BLAST

Why learn sequence database searching? Searching Molecular Databases with BLAST What have I cloned? Is this really!my gene"? Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration and exercises Has someone else already found it? What is this protein&s function? What is it related to? Can I get more sequence easily? Search programs are sequence alignment programs They try to $nd the best alignment between your probe sequence and every target sequence in the database Finding optimal alignments is computationally a very resource intensive process It is usually not necessary to $nd optimal alignments, particularly for large databases Alignments are ranked and only top scores are reported Practical database search methods incorporate shortcuts The fastest sequence database searching programs use heuristic algorithms The basic concept is to break the search and alignment process down into several steps At each step, only a best scoring subset is retained for further analysis What does %HEURISTIC& mean? Heuristic programs $nd approximate alignments!using a problem solving technique in which the most appropriate solution of several found by alternative methods is selected at successive stages of a program for use in the next step of the program" Why consider every possible alignment once a reasonably good alignment is found? They are less sensitive than!dynamic programming" algorithms such as Smith# Waterman for detecting weak similarity In practice, they run much faster and are usually adequate The BLAST program developed by Stephen Altschul and coworkers at the NCBI is the most widely used heuristic program

BLAST is a collection of $ve programs for di(erent combinations of query and database sequences Program Probe Database blastn DNA DNA blastp protein protein blastx tblastn tblastx translated DNA protein translated DNA protein translated DNA translated DNA BLAST features Very fast and can be used to search extremely large databases Su'ciently sensitive and selective for most purposes Robust # the default parameters can usually be used Scores are reported in various ways Typical BLAST Output Raw values based on the speci$c scoring matrix employed As bits, which are matrix independent normalized values Signi$cance as represented by E values The EXPECT )E* threshold is used to control score reporting A match will only be reported if its E value falls below the threshold set The default value for E is 10, which means that 10 matches with scores this high are expected to be found by chance Lower EXPECT thresholds are more stringent, and report fewer matches Probabilities reported are summations of the probabilities of multiple HSPs )High scoring Segment Pairs* For HSPs to be included in a sum statistic or gapped alignment they must exhibit consistency Same orientation Consistent order Don&t overlap Repeated motifs will result in multiple, independent alignments between query and subject sequences

Interpreting scores Interpreting scores Score interpretation is based on context What is the question? What else do you know about the sequences? Scoring is highly dependent on probe length Exact matches will usually have the highest scores )and lowest E values* Short exact matches may score lower than longer partial matches Short exact matches are expected to occur at random. Partial matches over the entire length of a query are stronger evidence for homology than are short exact matches. Read the sequence descriptions! Homology vs Identity Homologous sequences are derived from a common ancestral sequence. Homology is either true or false. It can never be partial! Saying two sequences are 45+ homologous is a misuse of the term. Sequence identity and similarity can be described as a percentage and are used as evidence of homology. BLAST Example Is this sequence known? What does it encode? >clone 14b cgcatgcgcaggcgacagctcatggcgttcagggcctgacggttgctagggtgacagggacacaacatggcg gcgggatctctaacgctctccttcgagggaccaccacggagatcctagtgcgggaccccgcctcagggaagt ggaaagcagggggacaaccttcctgcttccttcttttccgtccagtgtcggcaaggggttgtcaccggcttc cgcatccaagatgaagaactataaagcaattggcaaaataggagagggaacgttttctgaagttatgaagat gcaaagcctgagagatggaaactactatgcatgtaaacaaatgaagcagcgctttgaaagtattgagcaagt caacaacctacgagagatccaagcactgaggcgcctgaatccgcacccaaacattcttatgttgcatgaagt ggtttttgacagaaaatctggttctcttgcactaatatgtgaacttatggacatgaatatttatgagctaat acgagggagaagatacccattatcagaaaaaaaaattatgcactatatgtaccagttatgtaagtccctgga tcatattcacagaaatggaatatttcacagagatgtaaaaccagaaaatatactaataaagcaggatgtcct gaaattaggggactttggctcctgccggagtgtctattccaagcagccgtacacggaatacatctccacccg ctggtaccgggccccggagtgtctcctcactgatgggttctacacgtacaagatggacctgtggagcgccgg ctgtgtgttctacgagatcgccagtctgcagcccctctttcctggagtaaatgaactggaccaaatctcaaa aatccacgatgtcatcggcacacccgctcagaagatcctcaccaagttcaaacagtcgagagctatgaattt tgattttccttttaaaaagggatcaggaatacctctactaacaaccaatttgtccccacaatgcctctccct cctgcacgcaatggtggcctatgatcccgatgagagaatcgccgcccaccaggccctgcagcacccctactt ccaagaacagaggaaaacagagaagcgggctctgggcagccacagaaaagctggctttccggagcaccctgt ggcaccggaaccactcagtaacagctgccagatttccaaggagggcagaaagcagaaacagtccctaaagca agaggaggaccgtcccaagagacgaggaccggcctatgtcatggaactgcccaaactaaagctttcgggagt ggtcagactgtcgtcttactccagccccacgctgcagtccgtgcttggatctggaacaaatggaagagtgcc ggtgctgagacccttgaagtgcatccctgcgagcaagaagacagatccgcagaaggaccttaagcctgcccc gcagcagtgtcgcctgcccaccatagtgcggaaaggcggaagataactgagcagcaccgtcgtctcgacttc ggaggcaacaccaagcccgaccgggccaggcctgggtgatctgctgctgagacgccacggagggctggggat gcgcctgcgtccgtttcgcgctggccggggctctgggtgctgccctgcgccctgccgcacccgcggcccgcg cagctgcctaggatgttctgggctaatatacttgtaaaaccaccgcattctagggttttctttcattttcgt taagaatttggggcaggaaatactttgtaactttgtatatgaatcaaaacaaacgagcaggcatttctgtga tgtgttgggcgtggttggaaggtgggttctgcgtgtcccttcccagcgctgctggtcagtcgtggagcgcca tcatgtcttaccagtgacgctgctgacacccctgacttttattaaagaataagctgtcgttaaaaaaaaaaa aaaaaaaaaa Search Strategy BLAST program = blastn nucleotide query vs. nucleotide db Database = nr )non#redundant*

Search Summary Graphical View of BLAST Results Link to GenBank File Link to Alignment Link to GenBank File Link to UniGene Link to Gene Expression Omnibus

Homologs = Shared Evolutionary Ancestry = Conserved Function Orthologs are homologs that perform same function in di(erent species. Example: mouse, globin and human,globin Paralogs are homologs that are diverged members of a family Example: human, globin and human myoglobin Statistical signi$cance of scores Orthologs will have extremely signi$cant scores DNA 10 #100, Protein 10 #30 Closely related paralogs will have signi$cant scores. Protein 10 #15 Distantly related homologs may be hard to identify. Protein 10 #4 Basic BLAST form Choice of program Choice of database Filters on or o( Sequence input Paste in as text or fasta format Read in using gi or accession number Output format options BLASTP Example >Unknown protein MWVTKLLPALLLQHVLLHLLLLPIAIPYAEGQRKRRNTIHEFKKSAKTTL IKIDPALKIKTKKVNTADQCANRCTRNKGLPFTCKAFVFDKARKQCLWFP FNSMSSGVKKEFGHEFDLYENKDYIRNCIIGKGRSYKGTVSITKSGIKCQ PWSSMIPHEHSFLPSSYRGKDLQENYCRNPRGEEGGPWCFTSNPEVRYEV CDIPQCSEVECMTCNGESYRGLMDHTESGKICQRWDHQTPHRHKFLPERY PDKGFDDNYCRNPDGQPRPWCYTLDPHTRWEYCAIKTCADNTMNDTDVPL ETTECIQGQGEGYRGTVNTIWNGIPCQRWDSQYPHEHDMTPENFKCKDLR ENYCRNPDGSESPWCFTTDPNIRVGYCSQIPNCDMSHGQDCYRGNGKNYM GNLSQTRSGLTCSMWDKNMEDLHRHIFWEPDASKLNENYCRNPDDDAHGP WCYTGNPLIPWDYCPISRCEGDTTPTIVNLDHPVISCAKTKQLRVVNGIP TRTNIGWMVSLRYRNKHICGGSLIKESWVLTARQCFPSRDLKDYEAWLGI HDVHGRGDEKCKQVLNVSQLVYGPEGSDLVLMKLARPAVLDDFVSTIDLP NYGCTIPEKTSCSVYGWGYTGLINYDGLLRVAHLYIMGNEKCSQHHRGKV TLNESEICAGAEKIGSGPCEGDYGGPLVCEQHKMRMVLGVIVPGRGCAIP NRPGIFVRVAYYAKWIHKIILTYKVPQS

BLASTP databases BLASTP databases nr # All non#redundant GenBank CDS translations+pdb+swissprot+pir swissprot # the last major release of the SWISS#PROT protein sequence database pat # patented sequences pdb # Sequences derived from the 3#dimensional structure Protein Data Bank month # All new or revised GenBank CDS translation+pdb+swissprot+pir released in the last 30 days BLAST can be slow during peak hours )9#5 EST* Conserved Domains Request ID

Protein Scoring Matrices Blosom 62 is the default BLASTP scoring matrix Di(erent Matrices Produce slightly di(erent alignments BLOSOM 62 Query: 80 EDFKFGKILGEGSFSTVVLARELATSREYAIKILEKRHIIKENKVPYVTRERDVMSRLDH 139 +DFKFG ++G+G++STV+LA + T + YA K+L K ++I++ KV YV+ E+ + +L++ Sbjct: 177 KDFKFGSVIGDGAYSTVMLATSIDTKKRYAAKVLNKEYLIRQKKVKYVSIEKTALQKLNN 236 PAM30 Query: 81 DFKFGKILGEGSFSTVVLARELATS-----REYAIKILEKRHIIKENKVPYVTRERDVMS 135 DFKFG ++G+G++STV+ LATS R YA K+L K ++I++ KV YV+ E+ + Sbjct: 178 DFKFGSVIGDGAYSTVM----LATSIDTKKR-YAAKVLNKEYLIRQKKVKYVSIEKTALQ 232 DNA Databases nr # Non#redundant GenBank + EMBL + DDBJ + PDB sequences month # All new or revised nr dbest # GenBank+EMBL+DDBJ EST Divisions dbsts # GenBank+EMBL+DDBJ STS Divisions htgs # High Throughput Genomic Sequences EST = expressed sequence tag GSS = Genome Survey Sequence HTGS PAT = patented = High Throughput sequences PDB= sequences with known Genome Sequence structures Others # Bacterial and yeast genomes Sequence $lters Low Complexity Sequences can be Filtered Out Since only a limited number of matches are reported, hits to simple repeats and other low complexity sequences can obscure other more biologically meaningful similarities Filters are used to remove low complexity sequences from the probe Low Complexity, human repeats )blastn* Query: 1681 gatagttacagtggcgcccaaggcgatgaacagctggaacaaaatatgttccaattaacg 1740 Sbjct: 1852 gatagttacagtggcgcccaaggcgatgaacagctggaacaaaatatgttccaattaacg 1911 Query: 1741 ctggatacgtccacgattctgcaaagaagnnnnnnngttcaagaaaatgacgtagggcct 1800 Sbjct: 1912 ctggatacgtccacgattctgcaaagaagaaaaaaagttcaagaaaatgacgtagggcct 1971 Query: 1801 acaattccaataagcgccactatcagggaatag 1833 Sbjct: 1972 acaattccaataagcgccactatcagggaatag 2004

Output Options Pairwise Output is the Default Query: 1681 gatagttacagtggcgcccaaggcgatgaacagctggaacaaaatatgttccaattaacg 1740 Sbjct: 1852 gatagttacagtggcgcccaaggcgatgaacagctggaacaaaatatgttccaattaacg 1911 Query: 1741 ctggatacgtccacgattctgcaaagaagnnnnnnngttcaagaaaatgacgtagggcct 1800 Sbjct: 1912 ctggatacgtccacgattctgcaaagaagaaaaaaagttcaagaaaatgacgtagggcct 1971 Query: 1801 acaattccaataagcgccactatcagggaatag 1833 Sbjct: 1972 acaattccaataagcgccactatcagggaatag 2004 Query Anchored without Identities BLASTN vs BLASTP Protein sequences have much higher information content than nucleotide sequence To $nd evidence for sequence homology, use BLASTP and search protein sequences Is my sequence already in the database? To $nd identical sequences, search nucleotide databases Translated BLAST Searches Alternate Genetic Codes translations use all 6 frames computationally intensive tblastx searches are not allowed for some large databases must specify genetic code

Translated BLAST Searches Taxonomy Reports >clone 14b cctccccacccatttcaccaccaccatgacaccgggcacccagtctcctttcttcctgctgctgctcctcacagtgctta cagttgttacaggttctggtcatgcaagctctaccccaggtggagaaaaggagacttcggctacccagagaagttcagtg cccagctctactgagaagaatgctttgtctactggggtctctttctttttcctgtcttttcacatttcaaacctccagtt >Frame 1 PPHPFHHHHDTGHPVSFLPAAAPHSAYSCYRFWSCKLYPRWRKGDFGYPEKFSAQLY*EECFVYWGLFLFPVFSHFKPPV >Frame 2 LPTHFTTTMTPGTQSPFFLLLLLTVLTVVTGSGHASSTPGGEKETSATQRSSVPSSTEKNALSTGVSFFFLSFHISNLQ >Frame 3 SPPISPPP*HRAPSLLSSCCCSSQCLQLLQVLVMQALPQVEKRRLRLPREVQCPALLRRMLCLLGSLSFSCLFTFQTSS >Frame -1 NWRFEM*KDRKKKETPVDKAFFSVELGTELLWVAEVSFSPPGVELA*PEPVTTVSTVRSSSRKKGDWVPGVMVVVKWVGR >Frame -2 TGGLKCEKTGKRKRPQ*TKHSSQ*SWALNFSG*PKSPFLHLG*SLHDQNL*QL*AL*GAAAGRKETGCPVSWWW*NGWG >Frame -3 LEV*NVKRQEKERDPSRQSILLSRAGH*TSLGSRSLLFSTWGRACMTRTCNNCKHCEEQQQEERRLGARCHGGGEMGGE More BLAST Options More BLAST Options BLAST from ORF Finder

BLAST Tutorial BLAST tutorial on Biocomp Web page Goal: demonstrate utility and di(erence between BLASTN and BLASTP searches BLASTN: is my DNA sequence in the database? BLASTP: are there related )homologus* proteins in the database?