BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database sequences Sequences with many matches to high- scoring words are used for final alignments Protein based searches are always more powerful than nucleotide-base of coding g DNA in determining similarity and inferring homology

BLAST (Basic Local Alignment Search Tool) P=7+ Q=5 + G=6 In addition to the exact word, BLAST considers related words based on BLOSUM62: the neighborhood. Once a word is aligned, gapped and un-gapped extensions are initiated, tallying the cumulative score When the score drops more than X, the extension is terminated The extension is trimmed back to the maximum HSP= High scoring segment pair Produces local alignments X= significance decay S= min. score to return a BLAST hit T= neighborhood score threshold

BLAST home page http://blast.ncbi.nlm.nih.gov/blast.cgi

BLASTP

BLAST databases Peptide Sequence Databases nr: non-redundant GenBank CDS translations+pdb+swissprot+pir+prf RefSeq_protein: reference proteins Swissprot: SWISS-PROT protein sequence database pdb: Sequences derived from the 3-dimensional structure from Nucleotide Sequence Databases nr: GenBank+EMBL+DDBJ+PDB (no EST, STS, GSS, or WGS, or PAT). est: Expressed Seq. tags. 34 billion seq.! htgs: Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 gss: Genome Survey Sequence,. wgs: Whole Genome Shotgun Sequences. 148 billion sequences

BLAST Advanced options -G Cost to open a gap [Integer]; default = 11 (10 10 8 9) -E Cost to extend a gap [Integer]; default = 1 ( 1 2 2 2) -e Expectation value (E) [Real]; default = 10.00 -W Word size; default is 11 for blastn, 3 for other programs. -b Number of alignments to show (B) [Integer]; default = 100 Default Short Query Special Cases Large Sequence Family Ungapped BLAST Filter on off on on Scoring Matrix BLOSUM62 PAM30-35 BLOSUM62 BLOSUM62 Word Size 3 3-2, 7 for DNA 3, 11 for DNA 3, 11 for DNA E value 10 1000 or more 10 10 Gap costs 11, 1 9, 1 11, 1 4 Alignments 50 50 2000 50

Report by species Database: All nr GenBank CDS translations+pdb+swissprot+pir+prf 2,794,673 sequences; 957,836,323 total letters Taxonomy reports Query= Apetala1 P35631 (255 letters + indicates conservative amino acid substitution indicates gap/insertion XXXX shows areas of low complexity CONSIDER TAXONOMIC RELATIONSHIP WHEN INTERPRETING SIMILARITY VALUES!

Format BLAST output All sequences above the E value threshold are aligned beneath the query. In "with identity identical residues are shown as dots. Flat Query-Anchored Query-Anchored with identities

Statistical significance Chance alignments have no biological significance Statistical significance implies low probability of generating a chance alignment Probability of long alignments increases with longer sequences The extreme-value distribution Used to calculate the probability of chance alignment Generated by calculating the scores resulting from repeatedly scrambling one of the sequences being compared

BLAST statistics S (Bit score): calculated from raw score S (sum of BLOSUM62 scores) by normalizing with statistical variables that define a scoring system (K and λ). Bit scores from different alignments, even employing different scoring matrices can be compared. S =(λs-lnk)/ln2 k= minor constant λ= constant to adjust for scoring matrix S= score of High-scoring segment pair (HSP) E (expect) value: number of chance alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. E = mn2 -s m= query size N= database size S = bit score m*n= search space The E-value decreases exponentially as the Score (S) that is assigned to a match between two sequences increases. The E-value depends on the size of database and the scoring system in use. When the E-value threshold is increased from the default value of 10, more hits can be reported. When reduced, more significant hits are reported. The lower the E-value (or higher the bit score), the more significant the hit The product mn defines the search space. the same HSP may come out statistically significant in a small database and not significant in a large database

P values P: Probability bilit of finding at least one HSP with bit score S or higher by chance. Since it can be shown that t the number of random HSPs with score S' is described by Poisson distribution, the probability of finding at least one HSP with bit score S' is P = 1- e -E E= expect value E= 10 -> P =0.99995 E= 0.01 -> P =0.01 E= 1 -> P =0.63 E= 0.001 -> P =0.001 E= 0.1 -> P =0.095 E= 0.0001 -> P =0.0001 P-values vary from 0 to 1, whereas E-values can be much greater than 1. The BLAST programs report E-values, rather than P-values, because E-values of, for example, 5 and 10 are much easier to comprehend than P-values of 0.993 and 0.99995. However, for E < 0.01, P-value and E-value are nearly identical.

BLAST Tips Suggested BLAST cutoffs: DNA: book suggests E values < E -6 (I use E<e -10 ) Protein: book suggests E values < E -3 Consider evolutionary divergence in your results!: DNA mutation rate without selection =5.5 10-9 per site per year. So in 10 million years (10 7 ) of divergences= 5.5 10-2 =0.05 ~ 95% identity BLAST search artifacts: Repeated amino acid stretches (e.g. poly glutamine) or nucleotide repeats (e.g. ATATATATATATAT) result in meaningless positives with significant E values. Use BLAST filters to mask low complexity regions: programs SEG for proteins and DUST for DNA Or customize masking using lower case letter option RepeatMasker can be used to mask repeats in lower case letters http://www.repeatmasker.org/cgi-bin/webrepeatmasker

MEGABLAST Variation of BLASTN, 10 times faster Optimized for long or highly similar (>95%) sequences Ideal to find whether a large sequence is part of a large contig or chromosome, find sequencing errors and comparing large similar sequences Uses longer default word length (word length= 28 instead of 11) Faster non-affine gap penalty: gap opening penalty=0, gap extension penalty E= r/2 - q (r= match reward Non-affine gapping tends to yield more gaps of shorter length. Accepts multiple consecutive FASTA files as input Discontinuous MEGABLAST q= mismatch penalty) Ideal to compare divergent sequences from different organisms (<80% =) Uses a discontiguous word approach, different from other BLAST programs Nonconsecutive positions are examined over longer segments

PSI-BLAST (Position Specific Iterative BLAST) Designed to detect t weak relationships The added sensitivity comes from the use of a profile that is constructed (automatically) from a multiple alignment. The profile is generated by calculating a Position-Specific Scoring Matrix (PSSM) for every position in the alignment. Also called profiles of Hidden Markov Models PSSM are numerical representations of a multiple alignment A highly conserved ed position receives es a high score. The profile is used to perform additional searches ( iteration) and the results of each iteration used to refine the profile. Each iteration uses a PSSM built from the previous iteration. Continue search iteratively until no new matches are identified: "convergence". Construction of a PSSM PSI-BLAST steps BLASTP Multiple Alignment Construct PSSM Use PSSM to search Each columns in the alignment is a row in the PSSM Frequency of occurrence of a residue at each position Calculate Pb of each aa at each position T at position 8 conserved= highest score 150 P at position 9 less conserve= score 89 Note low scores of aromatic FYW relative to A at P row

PHI-BLAST (Pattern Hit Initiated BLAST) PHI-BLAST searches for particular patterns in protein queries. Combines matching of regular expressions with local alignments surrounding the match. PHI-BLAST is preferable to just searching for pattern occurrences because it filters out cases where the pattern occurrence is pb. random and not indicative of homology. PHI-BLAST expects as input a protein query sequence and a pattern contained in that sequence. PHI-BLAST limits alignments to those that match the provided pattern. Statistical significance is reported using E-values as for other forms of BLAST, but the statistical method for computing the E-values is different. PHI-BLAST is integrated with Position-Specific Iterated BLAST (PSI-BLAST), so that the results of aphiblast PHI-BLAST query can be used for PSI-BLAST. Pattern: [C]-x(2)-[C]-x(10,16)-[H]-x(2,3)-[H] Syntax for pattern at http://www.ncbi.nlm.nih.gov/blast/html/phisyntax.html

Specialized BLAST http://www.ncbi.nlm.nih.gov/blast/ Great tool! Multiple Sequence Alignment COBALT