Annotation. Repeated sequences

Size: px
Start display at page:

Download "Annotation. Repeated sequences"

Transcription

1 Annotation Repeated sequences Premier tool for finding repeated sequences is Repeatmasker. Tarailo-Graovac M, Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Current Protoc. Bioinf. 4, 10-14, uses a large library of known repeat elements from many species

2 Annotation Finding genes Comparison to known sequences (extrinsic comparison) RNA, same or other species (Blast) proteins in other species (Blast) DNA, coding regions are more conserved than non-coding (LAGAN) Ab initio prediction (intrinsic comparison) intrinsic differences between coding and non-coding known sites/signals such as splice junctions and terminators

3 Sequence Comparison Sequence Database Searching Related sequences (homologs) have runs of similar bases or residues with few gaps insertion deletion mutations are much less common than missense constraints on structure and function keep sequences corresponding to proteins, and especially the interior of proteins, from changing Unrelated sequences have some random level of similarity truly random matches biased composition generic patterns of sequences codon choice amino acid choice and usage patterns

4 Sequence Comparison Sequence database searching

5 Sequence Comparison BLAST procedure Step 1: Compile list of high scoring words from query sequence Step 2: Scan database for "hits" Step 3: Extend regions with 2 hits into MSPs Step 4: Dynamic programming alignment around MSPs

6 Sequence Comparison BLAST Step 1 - List of High Scoring Words Choose a significance level S Choose a word size, w, and cutoff, T, so that you are unlikely to miss MSPs with score S Make a table of all words in the "neighborhood" of the query (DNA sequences use all words) Typically 50 words for each residue

7 Sequence Comparison BLAST Step 2 - Scan Database Scan only for words in neighborhood Use lookup tables (like FASTA) or finite automaton Keep data in memory to make it faster

8 Database Searching BLAST Maximal Segment Pairs (MSP) Highest scoring pair of identical length segments from two sequences Local alignment without gaps Expected distribution is known! In BLAST2, a diagonal must have two word hits before extension to MSP is attempted. In principal, must examine diagonal until score drops to zero Shortcut, only check until score drops by X initial identically matching word T G C A A T C G A T C G T C G T C C G T A T A C A : : : : : : : : : : running sum A G C T C G T G A T C C T G G T G G G A T C G G T match = +1 MSP mismatch = -1 Potential MSP Potential MSP

9 Cumulative Probability Sequence Comparison 1 Normal Distribution Cumulative Probability Probability

10 Sequence Comparison Statistics Sequence matching is not normal, it is extreme! Scores follow and extreme value or Gumbel distribution Z score can't be directly converted to probability Whenever you are looking at a distribution of maxima longest run of heads in coin toss maximum scores for each sequence in database Sequence matches are a lot like coin tosses! PTVQGLRLFE :: : : PTAAGQELLS

11 Probability Sequence Comparison Sequence Database Searching 1 Extreme Value Distribution Cumulative 0.25 Cumulative Probability Probability Run Length 0

12 Sequence Comparison BLAST is based on Significant MSPs Scoring system Must have at least one positive score Expected score must be less than zero E = f i s i Probability of an MSP scoring higher than S P(MSP>S) KNe - S N = size of data, K and are constants Karlin, S., and Altschul, S.F., Proc.Natl.Acad.Sci. 87, , 1990.

13 Sequence Comparison Sequence database searching Final step dynamic programming alignment (with gaps) around the MSPs

14 Sequence Comparison Dynamic programming alignments Alignment - Provides a one-to-one picture of the residues or bases in the sequences that correspond 1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF ::... :..:... : : GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVD 94.:.::.... :. :....:.:.: ::. 51 SEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIP 100 Computational problem is putting in gaps

15 Sequence Comparison Dynamic Programming Alignments Calculation of alignment score using affine gap model BOVGH G G T G C C A C - T C C C A C T G : : : : : : : : : : : : : A02321 A G T G C C A C C C C C A A T G C C G C T G x positive matches 52 negative matches -9 gaps -40 Overall Score 3 A 4 C -3 4 G T A C G T Gap opening = -12 Gap extension = -4

16 Sequence Comparison Dynamic Programming Alignments Much faster than brute force but too slow for searching Dynamic Programming is a Rigorous method Rigorous: no approximations, all numbers and positions of gaps EVERY possible alignment is considered Requires time proportional to product of sequence lengths, O(nm) Dynamic Programming yields an Optimal Alignment: Scores for matches/mismatches, s ij Penalties for gaps, g Score = s ij + g Dynamic Programming uses an Affine Gap Model: Lower Penalty to EXTEND a Gap than to CREATE or Open a new Gap Gap_Score = Gap_Open + Gap_extend x Gap_Length Generally Gap_Score is a negative score, i.e., a Penalty

17 Sequence Comparison Dynamic Programming Alignment Global alignments Use all positions in both sequences Use any scoring scheme (e.g., all positive) Best for sequences that you know are homologous (end-to-end) Local alignments Includes only the best matching parts of each sequence Multiple matching regions can be found by finding non-intersecting alignments (see, for example, LALIGN in the FastA package) Technical differences from global alignment Scoring system must have average score < 0 Values in score matrix are truncated at zero Highest score may be anywhere in score matrix Best for sequences where you do not expect the whole sequence to match (i.e., most of the time)

18 Sequence Comparison Dynamic Programming Alignments Problems There may be equally optimal, or nearly optimal alignments that are significantly different. Cannot align sequences with rearrangements (e.g., ABC CAB). Can be misleading when sequences have repeats (especially if there are different numbers of repeats in each sequence). Aligned region may extend into regions that have low or no similarity, or in the case of global alignments, may align regions that are completely random (if the true match is small).

19 Database Searching Scoring Systems - Log-odds matrices A log-odds scoring system evaluates the relative probabilities of a match representing true homology versus the chance that a match occurs at random, i.e. the relative probability of two models s ij = ln( q ij / p i p j ) Normally, one multiplies probabilities - since these are log probabilities you get the total probability by adding them up When added up over a matching segment, you get the probability that the segment represents homology relative to the probability that it represents a random match, i.e. how much more likely than chance is it that the matching segment represents homology

20 Database Searching Target frequencies Karlin and Altschul showed that for MSPs (Maximum Sequence Pairs), amino acids a i and a j will be aligned with frequency approaching q ij = p i p j e - s where p i and p j are the expected probabilities of observing the amino acid residues and s is the match score A given scoring matrix will try to align the residues according to the above equation, so q ij are a characteristic set of target frequencies for the scoring matrix S The correct scoring system is the one in which the target frequencies are the same as the frequencies of the actual aligned residues

21 Database Searching Scoring Systems - Log-odds matrices by rearranging the target frequency equation from the previous slide, equation we get: s ij = [ ln (q ij / p i p j ) ] / All scoring systems can therefore be looked on as log-odds matrices with an implied set of target frequencies! Since multiplying a log-odds scoring system by a constant won't change the relative score for local alignments, can be looked on as a scaling factor that we can choose to suit our convenience. One convenient choice for is ln2, so that the scores can be thought of as representing bits

22 Database Searching Scoring Systems - Information A bit of information is the amount of information needed to distinguish between 2 possibilities, i.e., one yes-no question. Taking as log 2, the Karlin-Altschul equation becomes p = KNe - s p = KN 2 -s Rearranging gives the score required to find a given number of MSPs with score S S = log2 ( K/p ) + log 2 N K is generally about 0.1 so the first term basically disappears The amount of information required to distinguish an MSP from chance therefore depends entirely on log 2 N, the size of the comparison N is the product of the lengths for two sequences, or the size of the database times the sequence length for a search

23 Database Searching Scoring Systems - Information How much information do you need to find something interesting? An MSP of about 16 bits is required for significance in a pairwise comparison of two 250 long sequences log 2 ( ) = bits For a 300 residue protein sequence and the NCBI nr database, log 2 (4,100,000,000 x 300 ) = 40.1 bits For a 1000 base long DNA sequence and the NCBI nr database, log 2 ( 33,300,000,000 x 1000) = 44.9 bits DNA gives about 2 bits per base Protein gives about 4

24 Sequence Comparison Database Searching Big problem is database size Bigger database means longer search. Alignments are O(n 2) Bigger database means worse signal to noise ratio Sequence data doubles every 18 months

25 Sequence Comparison Database Searching BLAST BLAST (or FASTA) is used to identify the same gene in new genomic sequence How good does the score need to be, rules of thumb E < very good match E < strong match, probably part of family E < distantly related E > 10-5 hmmm E > 1?

26 Database Searching BLAST Output has three main parts diagram of matches list of top scores alignments check database length of query

27 Sequence Comparison Database Searching - BLAST

28 Sequence Comparison Database Search BLAST Top scores unigene links assembled EST gene links gene model chromosome region

29 Sequence Comparison Database Search BLAST Alignments

30 Sequence Comparison Database Search BLAST Other alignment output

31 Sequence Comparison BLAST How good an E value do you need? E = number of sequences with score >= observed in random database E = 1/number of searches to find a score >= observed E = size of db * P(Score>=observed) E < very good match E < strong match, probably part of family E < distantly related E > 10-5 hmmm E > 1?

32 Sequence Comparison Database Search BLAST E=0 nearly identical (92%) duplicated gene, alleles, highly conserved family chance that differences are errors

33 Sequence Comparison Database Search BLAST E=e-77 same sub-family probably has nearly identical function continuous high similarity

34 Sequence Comparison Database Search BLAST E=e-25 functionally similar same family/group

35 Sequence Comparison Database search BLAST E=e-09 same superfamily active site and other features clearly visible

36 Sequence Comparison Database search BLAST E=9 In this case clearly a distant homolog, why is the score so low? Possibly bad gene model interrupted gene transposon assembly error pseudogene

37 Sequence Comparison Database Searching Orthologs Orthologs are generally assumed to be functionally identical (thus all the fuss) Usually defined for genomic analysis as mutual best hits in BLAST search a is best hit for b, b is best hit for a If the automatic gene prediction misses genes this is not reliable Stochastic gain (by duplication) and loss (by deletion) of genes makes identifying orthology problematic Does it really make sense for large homologous families, especially in the presence of gene duplication?

38 Sequence Comparison Known genes and proteins Because we now know many genomes and many proteins, use these as queries to identify genes in new sequences cdna (EST) sequences can be used in the same way Problems Organism may have novel genes Evolutionary divergence may make it difficult to find homologs Gene databases may contain bad gene models or other "wrong" sequences Database annotation may be incorrect (database pollution), or disagree between hits EST collection may be incomplete Sequencing and assembly errors make matches hard to detect Genome may contain pseudogenes

39 Sequence Comparison Known genes and proteins BLAST (nucleotide) may not show much

40 Sequence Comparison Known Genes/Proteins BLASTX DNA query (translated) protein database Finds matches to known proteins and gene models May miss alternative exons hypothetical protein KDEL receptor A ERD2 (ER lumen protein retaining receptor 2)

41 Sequence Comparison Known Genes/Proteins TBLASTX DNA query translated DNA database translated Finds matches to genes that may have been missed in annotation of database genomes May find missing or alternative exons TBLASTX BLASTX

42 Sequence Comparison Known Genes/Proteins TBLASTX generally works best at an intermediate distance Too close entire sequence matches Too far nothing matches Just right exons only match TBLASTX vs arabidopsis P. patens

43 Sequence Comparison Maize genome many short matches all from various BACs from a single species, surprisingly no complete match

44 Sequence Comparison Maize genome Same as before, but excluding Zea mays what you see for a novel genome

45 Sequence Comparison Maize genome BlastX agains Swissprot database Note that search against NR fails due to cpu time R R R R R R R R R Three clear hits for nonretrotransposon proteins Zn-finger, 51934,53415;17593,18117 match is to human both matches are fragments Pentatricopeptide (PPR), matches many A. thaliana Receptor kinase, (rev) similar to AT2G26730, AT5G58300, AT3G08680 mito. protein Zn-finger MYM PPR protein LRR RK like

46 Sequence Comparison Maize genome

47 Sequence Comparison Maize genome

48 Sequence Comparison Maize genome arabinokinase of 1039 gamma-aminobutyrate transaminase 3, mitochondrial Zn-finger of 1142 R SUMO of 109

49 Sequence Comparison Maize genome (66 90 kb) Protein Phosphatase 2C residues of 353 Transcription Factor/WD repeat/notchless transcriptional corepressor LEUNIG

50 Sequence Comparison Maize genome (90-96Kb) Calcium dependent protein kinase (reverse strand) of hit sequence contains active site CDPK Copia polymerase

51 Sequence Comparison Maize genome (124 Kb 137 Kb)

52 Sequence Comparison Maize genome (25Kb 35 Kb) Transposon region (all hits are retrotransposons)

53 Gene Modeling Intrinsic methods Rely on composition of sequence itself The requirement of preserving a coding sequence places strong constraints on the DNA Coding sequence has preferred amino acid residues Amino acid residues have preferred codons (in general the first two positions are fixed) Coding constraints affect all six possible reading frames Observation: genes have higher GC content

54 Gene Modeling GC Content

55 Gene Modeling GC Content Is high GC content of genes due to gene conversion? genes undergoing concerted evolution in mammals have high GC content, flanking regions do not ribosomal operons transfer RNAs histones GC content is high in recently translocated genes into mouse PAR gene, GC increased from 50 to 73% in <1 million years, strongly suggests that recombination is the cause, not the consequence, of a high GC content. GC content is high in regions with high recombination Bird microchromosomes have a high recombination rate, at least one recombination event per generation. very high GC content, probably even higher than mammalian GC-rich regions Genomic analysis of drosophila, mice, yeast and others shows higher GC content in high recombination regions significantly greater linkage disequilibrium (i.e., presumably less recombination) in GC-poor regions Unequal base excision repair

56 Gene Modeling Intrinsic methods Protein coding enforces several constraints on the underlying DNA sequence Amino acid residues are used unequally in proteins Amino acid residues have unequal numbers of codons Codons are used unequally in synonymous families Methods Usually measured using a sliding window approach. Window depends on the method but is often (can be as big as 1000) Differences are small and you have to average over a large window to get a relatively clear signal. Small exons are therefore hard to find

57 Gene Modeling Intrinsic methods ORF length

58 Gene Modeling Intrinsic Methods Some methods N-word counts, most commonly hexamer Global patterns (GC content, CpG islands) Open reading frame (genes have longer ORF) Other base asymmetry measures

59 Gene Modeling Gene Modeling Intrinsic Search by Content Search by Signal Extrinsic Sequence Matching GC content CpG Hexamer HMM Promoters Splice Sites Poly A sites EST cdna Protein Genome

60 Gene Modeling Markov Models The most accurate intrinsic methods are based on some version of a Markov model (Markov chain) Markov models capture the dependence of bases on other nearby bases Idea: make a probabilistic models of Exons (3 forward, three backward frames) Also initial, internal, terminal exons Introns UTRs Intergenic regions Test genomic sequence to see which model it fits best (has highest probability)

61 Gene Modeling Markov Models Probabilistic models that describe transition from one state to another Usually framed as conditional probabilities A model of a random DNA: 40% AT 60%GC Think of a model as generating a sequence When I am in a certain state, I write out 1 letter according to the state Then, according to the probabilities I move to a new state in this case, I move to an A or T 40% of the time a G or C 60% of the time the probabilities are the same for all bases this is a zeroth order model P=0.2 P=0.3 A G P=0.3 C P=0.2 T same for G and T A G P=0.2 P=0.3 P=0.2 P=0.3 C T

62 Gene Modeling Markov Models A simpler model flipping a coin P=0.5 H P=0.5 P=0.5 T P=0.5 A fair coin OR H: 0.5 T: 0.5 P=1.0 THHTHTHTTH P=0.8 H P=0.2 P=0.8 T P=0.2 An unfair coin OR H: 0.8 T: 0.2 P=1.0 HHTHHHHTHT

63 Gene Modeling Markov Models For a given sequence of Hs and Ts I can predict which coin I'm using, say HHTH P(HHTH fair) = P(H)P(H)P(T)P(H) = 0.5 * 0.5 * 0.5 * 0.5 = P(HHTH unfair) = P(H)P(H)P(T)P(H) = 0.8 * 0.8 * 0.8 * 0.2 = If I know I'm always using a single coin What if I don't know the coin or the coin could be switched?

64 Gene Modeling Markov Models What if I am flipping both coins, and randomly switching between them? I generate some sequence of H and Ts and I want to know which coin was I using? maybe P=0.2 P=0.6 H: 0.5 T: 0.5 P=0.4 HTHHHTHHTTHTHTHTHTHTHHHTHTH UUUUUUUUFFFFUUFFFFFFUUUUUFF H: 0.8 T: 0.2 If I don't know the state, it is a Hidden Markov Model (HMM) P=0.8

65 Gene Modeling Markov Models Discovering the state in a HMM HTHHTHTTHHH For HTHHTHTTHHH I first need to know the probability that I am in state F (fair) or unfair (U) given the first H By Bayes' rule P(F H) = P(H F)*P(F)/P(H) P(H F) = prob. head in fair state = 0.5 P(F) = prob. of fair state =.33 (prior probability of fair state) P(H) = marginal probability of H = average fraction of heads P(F H) = P(U H) = P(H U)*P(U) / P(H) = 0.761

66 Gene Modeling Markov Models HTHHTHTTHHH Next flip is T the prior probability of F and U are the conditional probability based on the first H, times the transition probabilities P(F) = P(F H) * P(F F) + P(U H) * P(U F) = * *0.2 = P(U) = P(U H) * P(U U) + P(F H) * P(F U) = 0.761* *.4 = P(F HT) = P(T F) * P(F) / P(T) = 0.5 * / = P(U HT) = P(T U) * P(U) / P(T) = 0.8 * / = P(T) = P(T U)*P(U) + P(T F)+P(F) = 0.5* *0.704 = 0.289

67 Gene Modeling Markov Models continuing for each flip I calculate the following probabilities for the two states H T H H T H T T H H H Fair Un fair

68 Gene Modeling Markov Models What does this have to do with predicting genes? Simple model State 1: intron State 2: exon Describe each state by its probability of each of the four bases. Transition probabilities between states can be obtained from average intron and exon lengths

69 Gene Modeling Markov Models - More Realistic Usually use 5 th order model, i.e., a hexamer model for base i at position=p P(Base i,p Base i,p-5 p-1,state) States intergenic region first exon intron internal exon terminal exon sites

70 Gene Modeling Markov Models Typical States used in modern HMM gene predictors typically also include intron length exon length gene length (number of exons) intro/exon boundaries

71 Gene Modeling Markov Models Model must be trained needs to have known data to learn base frequencies, lengths, etc. For a new genome, this information is usually not available (or you can't get the group with the software to run the training for you) Options Use as close a genome as you can Use several if several are close GC content is a big factor Site recognition is better with closer species Use a self-training system (e.g., genemark-es)

72 Gene Modeling Genemark-ES Self-training gene predictor HMM based Intialize with canonical splice sites (2 bases only) uniform length distribution non-exon models are 0th order exon model, one of 2nd order Markov chain with heuristically defined parameters fifth-order Markov chain from ORFs > 1000 nt 0th order Markov model with GC content elevated by 8% over genome GC content. first train only emission parameters for coding/non-coding states then add in sites then add lengths

73 Gene Modeling GeneMark self training

74 Gene Modeling Genemark-ES

75 Gene Modeling Selected Gene Prediction Programs Program Type Focus Site FGENESH HMM animals, plants Genemark HMM prokaryotes, plants, human, mouse, worm Glimmer/ GlimmerM IMM Prokaryotes Eukaryotes GENSCAN GHMM Vertebrates, Arabidopsis, maize Augustus GHMM + extrinsic Eukaryotes PASA GeneMark HMM, FGENESH, Augustus and SNAP GlimmerHMM Maker Extrinsic (repeatmasker, Blast,Snap, Exonerate) Eukaryotes

76 Gene Modeling Gene Prediction Programs Best varies with genome, difficult to predict Best varies with individual genes, difficult to predict

77 Gene Modeling Gene Modeling in Maize 5 programs FGENESH GeneMark GENSCAN GlimmerR Grail Test Data Hand annotated using FL cdna 114,173 ISU MAGIs = maize assembled genomic islands 592 reliable exons 1946 reliable introns H Yao, L Guo, Y Fu, LA Borsuk, T-J Wen, DS Skibbe, X Cui, BE Scheffler, J Cao, SJ Emrich, DA Ashlock and PS Schnable. Evaluation of five ab initio gene prediction programs for the discovery of maize genes, Plant Molec. Biol. 57, , 2005

78 Gene Modeling Maize gene modeling Size and GC content of introns and exons

79 Gene Modeling Maize gene modeling Sensitivity = SN = TP / (TP + FN) Specificity = SP = TP / (TP + FP)

80 Gene Modeling Maize gene modeling

81 Gene Modeling Maize gene modeling PE = partial exon (one boundary correct) OE = Overlapped exon (no boundaries correct) ME = missing exon WE = wrong exon (extra exon)

82 Gene Modeling Maker

83 Gene Modeling Maize gene modeling

84 Gene Modeling Maize gene modeling

85 Gene Modeling Promoter finding (plant genomes) GC and CpG Rombauts et al. Plant Phys.132, (2003)