BLAST Basic Local Alignment Search Tool. Optimized for finding local alignments between two sequences. An example could be aligning an mrna sequence to genomic DNA. Proteins are frequently composed of functional domains repeated in many different proteins. These parts are most likely to be conserved. BLAST Since DNA databases can be very large, searching for the optimal alignment to all sequences can take too long. BLAST is a heuristic algorithm. Heuristic algorithms find a match reasonably close to the optimal one in a much shorter time than the full dynamic programming. The alignment can then separately be verified / refined using dynamic programming. 1
How BLAST works The query sequence is divided into subsequences of a given length. word size 3 for proteins, 11 for nucleotides. These are used to look for exact or nearly exact matches in the sequence database. Fast to do = computationally inexpensive. When a match is found, it is extended further. Word size (W=3) KRISTIAN KRISTIAN KRISTIAN KRISTIAN KRISTIAN KRISTIAN Q u e r y Seeding Search space Database Word hits Alignment Gapped alignment 2
Threshold in seeding Word hit Hit is two matching, identical words, one in database, another in the query sequence (used in blastn) Hit is a neighborhood (used in protein-related searches) The neighborhood of a word contains the word itself and all other words whose score is at least as big as T (threshold) when compared via the scoring matrix. For example, if T=13, word=pqg, matrix=blosum62, only words getting a score over 13 will be scored as hits: PQG-PEG (15) is accepted, but PQG-PQA (12) is not. Setting T higher will remove more word hits, making BLAST run faster, but increases the chance of missing an interesting alignment. Setting W (wordsize) higher will decrease sensitivity (chance of finding the alignment), but increase speed of the search. 3
Extension Word hits found during seeding are extented from their ends. Extension is stopped when the alignment score drops, or in newer implementations, when the alignment score has dropped enough (drop-off score) compared to its previous maximum. Alignment Word hit Extension Extension, example drop off score KRISTIAN gap=0, X=2 -RISTISANA BLOSUM62 0544541200 <- BLOSUM62 values 059 18 23 21 13 22 21 21 <- Score 00000002 <- Drop off score Extension terminates when drop off score falls below X. 4
Evaluation When the extension stage has produced the alignments, they will be evaluated to determine whether they are statistically significant. Statistical significance is determined using Karlin-Altschul statistics (the E-score) Some simplifying assumptions are made (such as sequences inifinitely long, no gaps), but in practice, K- A statistics is nicely generalizable. E-score The lower the E-score, the more significant the alignment The E-score is dependent on both the database size and the scoring system (substitution matrix, gap penalties). If these are changed, the E-score for a specific alignment will also change. 5
Karlin-Altschul statistics E value. E = Kmne S E = number of alignments reaching score S just by chance K = minor constant m = the length of query sequence n = the length of the database e (neperin luku) 2,71 S = normalized alignment score (S is the score, lambda is the normalization factor) NOTE: When E is very small, it can be interpreted equivalently to p-value! Karlin-Altschul, example What is the chance that when two equally long (250) amino acid sequences are aligned using PAM250 matrix, the alignment score is 75? E = Kmne S = 0,1*250*250*2,71 -(0,229*75) = 0,000217 http://www.ncbi.nlm.nih.gov/blast/tutorial/altschul-1.html 6
Disadvantages of BLAST When expected sequence similarity drops below 80%, nucleotide-nucleotide blast no longer performs that well. Many significant homologies are missed due to the initial word size requirement. If initial words are allowed to be discontinuous, matching is improved. Discontinuous initial words For instance, require 11 positions out of 21 consecutive nucleotides to be homologous 7
Filtering out repeats The human genome (like most others) contains large amounts of repetitive DNA. (LINE, SINE, Alu, et.c.) If the query sequence contains repeats, many homologies identified will be to other sequences containing repeats. Repeats should in most instances be masked out. Usually represented as AATAGNNNNCGC Different varieties of BLAST DNA query against a database of DNA sequences (blastn). Protein query against protein sequences (blastp). DNA query translated in six reading frames against a protein database (blastx). Megablast, for large and closely related sequences. 8
Blastn and Megablast Typically used for identifying your sequence. Megablast is a fast alternative for finding nearly exact matches. Blastn is better at finding somewhat diverged sequences (e.g. from a related species). Blastx and tblastx Blastx translates the query sequence in all reading frames and compares it to a protein database. Aggregate statistics are provided for all reading frames. Tblastx queries a translated DNA sequence against a database of translated DNA sequences. Also produces aggregate statistics for all reading frames. 9
BLAST programs Query Database Program Typical uses DNA DNA blastn Annotation, mapping oligonucleotides to genome protein protein blastp Identifying common regions in proteins translated DNA protein blastx Finding protein-coding genes in genomic DNA protein translated DNA tblastn Identifying transcripts, possibly from multiple organisms translated DNA translated DNA tblastx Cross-species gene prediction, searching for genes not yet in megablast protein databases Large and closely related sequences Specialized BLAST Choose a type of specialized search (or database name in parentheses): Make specific primers with Primer-BLAST (Finding primers specific to your PCR template http://www.ncbi.nlm.nih.gov/tools/primerblast/index.cgi?link_loc=blasthome) Find conserved domains in your sequence (cds) Find sequences with similar conserved domain architecture (cdart) Search sequences that have gene expression profiles (GEO) 10
... Search immunoglobulins (IgBLAST) Screen sequence for vector contamination (vecscreen) Align two (or more) sequences using BLAST (bl2seq) Search protein or nucleotide targets in PubChem BioAssay Search SRA transcript and genomic libraries Constraint Based Protein Multiple Alignment Tool Needleman-Wunsch Global Sequence Alignment Tool Search RefSeqGene Search WGS sequences grouped by organism Yet more varieties PSI-Blast (Position Specific Iterated Blast) for very sensitive protein sequence against protein database searches (käyttötarkoitus: samaan proteiiniperheeseen kuuluvien proteiinien haku) PHI-Blast (Pattern-Hit Initiated Blast): hakusekvenssistä etsitään ensin käyttäjän antama pattern, jota sitten haetaan tietokannasta... 11
...miten valita omaan tarkoitukseen sopivin blast-versio?! Apua ohjelman valintaan: http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstab 12