BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments

Size: px

Start display at page:

Download "BLAST. compared with database sequences Sequences with many matches to high- scoring words are used for final alignments"

Barbra Johnston
6 years ago
Views:

1 BLAST 100 times faster than dynamic programming. Good for database searches. Derive a list of words of length w from query (e.g., 3 for protein, 11 for DNA) High-scoring words are compared with database sequences Sequences with many matches to high- scoring words are used for final alignments Protein based searches are always more powerful than nucleotide-base of coding g DNA in determining similarity and inferring homology

BLAST (Basic Local Alignment Search Tool) P=7+ Q=5 + G=6 In addition to the exact word, BLAST considers related words based on BLOSUM62: the neighborhood.

2 BLAST (Basic Local Alignment Search Tool) P=7+ Q=5 + G=6 In addition to the exact word, BLAST considers related words based on BLOSUM62: the neighborhood. Once a word is aligned, gapped and un-gapped extensions are initiated, tallying the cumulative score When the score drops more than X, the extension is terminated The extension is trimmed back to the maximum HSP= High scoring segment pair Produces local alignments X= significance decay S= min. score to return a BLAST hit T= neighborhood score threshold

3 BLAST home page

4 BLASTP

Sequence Databases nr: GenBank+EMBL+DDBJ+PDB (no EST, STS, GSS, or WGS, or PAT). est: Expressed Seq. tags. 34 billion seq.

5 BLAST databases Peptide Sequence Databases nr: non-redundant GenBank CDS translations+pdb+swissprot+pir+prf RefSeq_protein: reference proteins Swissprot: SWISS-PROT protein sequence database pdb: Sequences derived from the 3-dimensional structure from Nucleotide Sequence Databases nr: GenBank+EMBL+DDBJ+PDB (no EST, STS, GSS, or WGS, or PAT). est: Expressed Seq. tags. 34 billion seq.! htgs: Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2 gss: Genome Survey Sequence,. wgs: Whole Genome Shotgun Sequences. 148 billion sequences

6 BLAST Advanced options -G Cost to open a gap [Integer]; default = 11 ( ) -E Cost to extend a gap [Integer]; default = 1 ( ) -e Expectation value (E) [Real]; default = W Word size; default is 11 for blastn, 3 for other programs. -b Number of alignments to show (B) [Integer]; default = 100 Default Short Query Special Cases Large Sequence Family Ungapped BLAST Filter on off on on Scoring Matrix BLOSUM62 PAM30-35 BLOSUM62 BLOSUM62 Word Size 3 3-2, 7 for DNA 3, 11 for DNA 3, 11 for DNA E value or more Gap costs 11, 1 9, 1 11, 1 4 Alignments

7 Report by species Database: All nr GenBank CDS translations+pdb+swissprot+pir+prf 2,794,673 sequences; 957,836,323 total letters Taxonomy reports Query= Apetala1 P35631 (255 letters + indicates conservative amino acid substitution indicates gap/insertion XXXX shows areas of low complexity CONSIDER TAXONOMIC RELATIONSHIP WHEN INTERPRETING SIMILARITY VALUES!

8 Format BLAST output All sequences above the E value threshold are aligned beneath the query. In "with identity identical residues are shown as dots. Flat Query-Anchored Query-Anchored with identities

Statistical significance Chance alignments have no biological significance Statistical significance implies low probability of generating a chance alignment Probability of long alignments increases

9 Statistical significance Chance alignments have no biological significance Statistical significance implies low probability of generating a chance alignment Probability of long alignments increases with longer sequences The extreme-value distribution Used to calculate the probability of chance alignment Generated by calculating the scores resulting from repeatedly scrambling one of the sequences being compared

10 BLAST statistics S (Bit score): calculated from raw score S (sum of BLOSUM62 scores) by normalizing with statistical variables that define a scoring system (K and λ). Bit scores from different alignments, even employing different scoring matrices can be compared. S =(λs-lnk)/ln2 k= minor constant λ= constant to adjust for scoring matrix S= score of High-scoring segment pair (HSP) E (expect) value: number of chance alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. E = mn2 -s m= query size N= database size S = bit score m*n= search space The E-value decreases exponentially as the Score (S) that is assigned to a match between two sequences increases. The E-value depends on the size of database and the scoring system in use. When the E-value threshold is increased from the default value of 10, more hits can be reported. When reduced, more significant hits are reported. The lower the E-value (or higher the bit score), the more significant the hit The product mn defines the search space. the same HSP may come out statistically significant in a small database and not significant in a large database

11 P values P: Probability bilit of finding at least one HSP with bit score S or higher by chance. Since it can be shown that t the number of random HSPs with score S' is described by Poisson distribution, the probability of finding at least one HSP with bit score S' is P = 1- e -E E= expect value E= 10 -> P = E= > P =0.01 E= 1 -> P =0.63 E= > P =0.001 E= 0.1 -> P =0.095 E= > P = P-values vary from 0 to 1, whereas E-values can be much greater than 1. The BLAST programs report E-values, rather than P-values, because E-values of, for example, 5 and 10 are much easier to comprehend than P-values of and However, for E < 0.01, P-value and E-value are nearly identical.

12 BLAST Tips Suggested BLAST cutoffs: DNA: book suggests E values < E -6 (I use E<e -10 ) Protein: book suggests E values < E -3 Consider evolutionary divergence in your results!: DNA mutation rate without selection = per site per year. So in 10 million years (10 7 ) of divergences= =0.05 ~ 95% identity BLAST search artifacts: Repeated amino acid stretches (e.g. poly glutamine) or nucleotide repeats (e.g. ATATATATATATAT) result in meaningless positives with significant E values. Use BLAST filters to mask low complexity regions: programs SEG for proteins and DUST for DNA Or customize masking using lower case letter option RepeatMasker can be used to mask repeats in lower case letters

13 MEGABLAST Variation of BLASTN, 10 times faster Optimized for long or highly similar (>95%) sequences Ideal to find whether a large sequence is part of a large contig or chromosome, find sequencing errors and comparing large similar sequences Uses longer default word length (word length= 28 instead of 11) Faster non-affine gap penalty: gap opening penalty=0, gap extension penalty E= r/2 - q (r= match reward Non-affine gapping tends to yield more gaps of shorter length. Accepts multiple consecutive FASTA files as input Discontinuous MEGABLAST q= mismatch penalty) Ideal to compare divergent sequences from different organisms (<80% =) Uses a discontiguous word approach, different from other BLAST programs Nonconsecutive positions are examined over longer segments

Also called profiles of Hidden Markov Models PSSM are numerical representations of a multiple alignment A highly conserved ed position receives es a high score.

14 PSI-BLAST (Position Specific Iterative BLAST) Designed to detect t weak relationships The added sensitivity comes from the use of a profile that is constructed (automatically) from a multiple alignment. The profile is generated by calculating a Position-Specific Scoring Matrix (PSSM) for every position in the alignment. Also called profiles of Hidden Markov Models PSSM are numerical representations of a multiple alignment A highly conserved ed position receives es a high score. The profile is used to perform additional searches ( iteration) and the results of each iteration used to refine the profile. Each iteration uses a PSSM built from the previous iteration. Continue search iteratively until no new matches are identified: "convergence". Construction of a PSSM PSI-BLAST steps BLASTP Multiple Alignment Construct PSSM Use PSSM to search Each columns in the alignment is a row in the PSSM Frequency of occurrence of a residue at each position Calculate Pb of each aa at each position T at position 8 conserved= highest score 150 P at position 9 less conserve= score 89 Note low scores of aromatic FYW relative to A at P row

PHI-BLAST (Pattern Hit Initiated BLAST) PHI-BLAST searches for particular patterns in protein queries. Combines matching of regular expressions with local alignments surrounding the match.

15 PHI-BLAST (Pattern Hit Initiated BLAST) PHI-BLAST searches for particular patterns in protein queries. Combines matching of regular expressions with local alignments surrounding the match. PHI-BLAST is preferable to just searching for pattern occurrences because it filters out cases where the pattern occurrence is pb. random and not indicative of homology. PHI-BLAST expects as input a protein query sequence and a pattern contained in that sequence. PHI-BLAST limits alignments to those that match the provided pattern. Statistical significance is reported using E-values as for other forms of BLAST, but the statistical method for computing the E-values is different. PHI-BLAST is integrated with Position-Specific Iterated BLAST (PSI-BLAST), so that the results of aphiblast PHI-BLAST query can be used for PSI-BLAST. Pattern: [C]-x(2)-[C]-x(10,16)-[H]-x(2,3)-[H] Syntax for pattern at

16 Specialized BLAST Great tool! Multiple Sequence Alignment COBALT

Data Retrieval from GenBank

Data Retrieval from GenBank Peter J. Myler Bioinformatics of Intracellular Pathogens JNU, Feb 7-0, 2009 http://www.ncbi.nlm.nih.gov (January, 2007) http://ncbi.nlm.nih.gov/sitemap/resourceguide.html Accessing