Alignment to a database November 3, 2016
How do you create a database? 1982 GenBank (at LANL, 2000 sequences) 1988 A way to search GenBank (FASTA)
Genome Project 1982 GenBank (at LANL, 2000 sequences) 1988 A way to search GenBank (FASTA)
FASTA FASTA Find regions of identity (SW) Score & save best Choose regions for banded alignment Optimal realignment with gaps
Genome Project 1982 GenBank (at LANL, 2000 sequences) 1988 A way to search GenBank (FASTA) 1988 Try to give GenBank to the librarians (NLM)
Genome Project 1982 GenBank (at LANL, 2000 sequences) 1988 A way to search GenBank (FASTA) 1988 Try to give GenBank to the librarians (NLM) 1990 NCBI established
Genome Project 1990 Basic Local Alignment Search Tool published 1992 NCBI gets GenBank and LANL wants it back 1992-2007 GenBank size doubles every 18 months 2007-present GenBank growing frighteningly quickly October 2016, release 216: 220,731,315,250 bases in 197,390,691 sequences plus 1,676,238,489,250 bases in 363,213,315 WGS records
Why align to a database? Align unknown sequence to annotated genome to discover function Search RNA and EST databases to see if sequence is expressed mrna-to-genomic alignment for gene and isoform structure Search for unexpected conservation between sequences
BLAST Basic Local Alignment and Search Tool Rapid comparison of a query sequence against a database of nucleotide or protein sequences Why not use dynamic programming? it s guaranteed to find the optimal answer! Takes waaaaaay too long and requires too much memory on even a moderately-sized database BLAST is an efficient and effective alternative to dynamic programming.
BLAST How does it work? looks for small, high-scoring sequence matches to an indexed database extends the matches when it finds them, to create longer high-scoring matches alignment scores based on PAM/BLOSUM or gap/match/mismatch
BLAST how does it really work? Begin with a matrix of similarity scores for all possible residues, compile list of high-scoring words in the query Scan the indexed database for exact word hits (word length is a parameter) query ACTTGTGAACAT words ACTTGTG CTTGTGA TTGTGAA TGTGAAC GTGAACA TGAACAT database match TGTGAAC TAGGCTTGTGAACAGT
BLAST how does it really work? extend the match to create a maximal scoring pair (MSP) stop extending when the score drops below a threshold; trim backward to get maximal score ACTTGTGAACAT TAGGCTTGTGAACAGT 7 ACTTGTGAACAT TAGGCTTGTGAACAGT 8 ACTTGTGAACAT TAGGCTTGTGAACAGT 10 ACTTGTGAACAT TAGGCTTGTGAACAGT 9 scoring: match +1, mismatch -1
BLAST how does it really work? BLAST avoids low-complexity regions tabulates all k-tuples in the database DNA (k is usually around 8) and filters those that occur more frequently than some parameter BLAST has a mask at hash option that allows you to extend through the filtered regions Later versions of BLAST require two neighboring word hits to extend -> reduces # extensions sevenfold CAGCCTCTTACCAGCTTAGCTACAGTTGATTTCTCGGTCAGGCTCTTACCAGCT CAGGCTATTATTAGCTTAGCTACAGTAGATTTCTCGGTCAGGCTGGTACCATCT
Choice of parameters Time required = time to compile list of words + time to scan database + time to extend all hits You can modify both the wordsize and the threshold Increased wordsize = fewer hits, but greater number of words Initial word score threshold T will pare down the number of hits to be extended
BLAST statistics Karlin-Altschul statistics We don t know what the a priori score distribution looks like. In fact, we re looking for the maximum of a bunch of independently and identically distributed variables, which is more like an extreme value distribution.
BLAST statistics Karlin-Altschul statistics The expected number of HSPs with score at least S is: This is the E-value for the score S. K and λ are the Karlin-Altschul parameters. m and n are the lengths of the sequences
BLAST statistics 0.40 0.35 probability 0.30 0.25 0.20 0.15 0.10 normal distribution extreme value distribution 0.05 0-5 -4-3 -2-1 0 1 2 3 4 5 x
Gapped BLAST We have talked about ungapped BLAST so far. The statistics for gapped BLAST are trickier and they are not mathematically complete. affine gapped BLAST score = #matches*match score + #mismatches*mismatch penalty + #gaps*gap opening penalty + total gap length*extension penalty ACTTGTGCATT ACAT-TG--TT Things to consider when choosing a gap penalty: Both the opening (g) and extension (r) penalties should be nonzero g + r should be greater than the max score for a match if you want gaps to be rarer than substitutions
PSI-BLAST: Position-specific iterated BLAST Database search with query Look to see if newest hits are significantly related to query If yes, repeat #1 and 2 If no, finish Creates a PSSM (position-specific scoring matrix)
PSI-BLAST and PSSMs PSSM Gapless alignment matrix Add pseudocounts to avoid tuning to most closely related sequences Align to database with very high gap penalties Generally use dynamic programming to align
PSI-BLAST and PSSMs PSI-BLAST performs well compared to other motif-finding programs More sensitive to weak but biologically relevant similarities Can use resulting PSSMs to score other alignments or in PHI-BLAST, rpsblast (finding conserved domains) etc.
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST
PSI-BLAST
PHI-BLAST: Pattern hit initiated BLAST Investigator supplies a complex pattern to be searched against the database of interest Can use PSSMs created by PSI-BLAST Very sensitive Very fast
BLAT Designed to find DNA sequences 30+ bp long and > 95% identity, or protein sequences greater than 80% similarity over 20 amino acids or more DNA searches best between primates, protein among land vertebrates Keeps index of all non-overlapping 11mers of entire genome in memory (not repeats though) Takes up < 1GB RAM DNA wordsize 11, protein 4 Written by Jim Kent, free.
Repeats
The repeat problem Genomes, especially those of vertebrates (not pufferfish though) and plants, are highly repetitive Transposons (DNA and retrotransposons) Simple sequence, centromeres, telomeres Other semicomplex repeats of uncertain purpose If a large sequence is searched against a repeat-laden database, you ll just get the repeats Solution: pre-mask known repeats -- is this a good idea?
>sequence1 gcgttgctggcgtttttccataggctccgcccccctgacgagcatcacaaaaatcgacgc ggtggcgaaacccgacaggactataaagataccaggcgtttccccctggaagctccctcg tgttccgaccctgccgcttaccggatacctgtccgcctttctcccttcgggaagcgtggc tgctcacgctgtaggtatctcagttcggtgtaggtcgttcgctccaagctgggctgtgtg ccgttcagcccgaccgctgcgccttatccggtaactatcgtcttgagtccaacccggtaa agtaggacaggtgccggcagcgctctgggtcattttcggcgaggaccgctttcgctggag atcggcctgtcgcttgcggtattcggaatcttgcacgccctcgctcaagccttcgtcact ccaaacgtttcggcgagaagcaggccattatcgccggcatggcggccgacgcgctgggct ggcgttcgcgacgcgaggctggatggccttccccattatgattcttctcgcttccggcgg cccgcgttgcaggccatgctgtccaggcaggtagatgacgaccatcagggacagcttcaa cggctcttaccagcctaacttcgatcactggaccgctgatcgtcacggcgatttatgccg caagtcagaggtggcgaaacccgacaaggactataaagataccaggcgtttcccctggaa gcgctctcctgttccgaccctgccgcttaccggatacctgtccgcctttctcccttcggg ctttctcattgctcacgctgtaggtatctcagttcggtgtaggtcgttcgctccaagctg acgaaccccccgttcagcccgaccgctgcgccttatccggtaactatcgtcttgagtcca acacgacttaacgggttggcatggattgtaggcgccgccctataccttgtctgcctcccc gcggtgcatggagccgggccacctcgacctgaatggaagccggcggcacctcgctaacgg ccaagaattggagccaatcaattcttgcggagaactgtgaatgcgcaaaccaacccttgg ccatcgcgtccgccatctccagcagccgcacgcggcgcatctcgggcagcgttgggtcct gcgcatgatcgtgctagcctgtcgttgaggacccggctaggctggcggggttgccttact atgaatcaccgatacgcgagcgaacgtgaagcgactgctgctgcaaaacgtctgcgacct atgaatggtcttcggtttccgtgtttcgtaaagtctggaaacgcggaagtcagcgccctg
>sequence2 gaattccggaagcgagcaagagataagtcctggcatcagatacagttggagataaggacg gacgtgtggcagctcccgcagaggattcactggaagtgcattacctatcccatgggagcc atggagttcgtggcgctgggggggccggatgcgggctcccccactccgttccctgatgaa gccggagccttcctggggctgggggggggcgagaggacggaggcgggggggctgctggcc tcctaccccccctcaggccgcgtgtccctggtgccgtgggcagacacgggtactttgggg accccccagtgggtgccgcccgccacccaaatggagcccccccactacctggagctgctg caacccccccggggcagccccccccatccctcctccgggcccctactgccactcagcagc gggcccccaccctgcgaggcccgtgagtgcgtcatggccaggaagaactgcggagcgacg gcaacgccgctgtggcgccgggacggcaccgggcattacctgtgcaactgggcctcagcc tgcgggctctaccaccgcctcaacggccagaaccgcccgctcatccgccccaaaaagcgc ctgcgggtgagtaagcgcgcaggcacagtgtgcagccacgagcgtgaaaactgccagaca tccaccaccactctgtggcgtcgcagccccatgggggaccccgtctgcaacaacattcac gcctgcggcctctactacaaactgcaccaagtgaaccgccccctcacgatgcgcaaagac ggaatccaaacccgaaaccgcaaagtttcctccaagggtaaaaagcggcgccccccgggg gggggaaacccctccgccaccgcgggagggggcgctcctatggggggagggggggacccc tctatgccccccccgccgccccccccggccgccgccccccctcaaagcgacgctctgtac gctctcggccccgtggtcctttcgggccattttctgccctttggaaactccggagggttt tttggggggggggcggggggttacacggcccccccggggctgagcccgcagatttaaata ataactctgacgtgggcaagtgggccttgctgagaagacagtgtaacataataatttgca cctcggcaattgcagagggtcgatctccactttggacacaacagggctactcggtaggac cagataagcactttgctccctggactgaaaaagaaaggatttatctgtttgcttcttgct gacaaatccctgtgaaaggtaaaagtcggacacagcaatcgattatttctcgcctgtgtg aaattactgtgaatattgtaaatatatatatatatatatatatatctgtatagaacagcc tcggaggcggcatggacccagcgtagatcatgctggatttgtactgccggaattc