Designing Filters for Fast Protein and RNA Annotation. Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler

Designing Filters for Fast Protein and RNA Annotation Yanni Sun Dept. of Computer Science and Engineering Advisor: Jeremy Buhler 1

Outline Background on sequence annotation Protein annotation acceleration Filter design for sequence-model comparison Filter design for model-model comparison Noncoding RNA annotation acceleration Research summary 2

Post-Genomic Era Sequencing a bacterial genome within 1 day and a human genome in several months [Wadman 08] 3

Urgent Task: Sequence Annotation Find biologically meaningful regions part of the human chromosome 14 GGGG TCCA...ATG... trna a type of noncoding RNA transcription mrna translation protein protein domains 4

Sequence Annotation Method Comparative sequence analysis Using sequence similarity species 1..Y T C S Y C G K S F T Q S.. species 2..Y T C P Y C D K R F T Q R.. species 3..Y L C Y Y C G K T L S D R.. Part of protein family zf-c2h2 Compare a newly obtained sequence to a database of annotated sequence families 5

Sequence Annotation is Expensive! residues (millions) Protein family DB Pfam-A+Pfam-B: ~30,000 Requires CPU months to find noncoding RNA signals in a bacterial genome Requires CPU days to classify all the proteins generated by a genome Metagenomic data sets microbial genomes from human gut, Sea water, etc. Typical size: 10 9 bases [Chen & Pachter 05] 6

Preview of My Work Design filtration algorithms for sequence-profile HMM (phmm) comparison for protein annotation ~20-35 times speedup on Pfam vs. Swiss-Prot protein database search [Sun,Buhler] ECBB 06 (Bioinformatics 07), [Sun,Buhler] IEEE/ACM TCBB 08 PHMM-pHMM comparison for protein annotation ~21-31 times speedup on Pfam vs. Pfam search Sequence-Stochastic Context Free Grammar comparison for noncoding RNA annotation ~200 times speedup on Rfam vs. 65-Mbase genomic sequence database search [Sun,Buhler] CSB 08 7

Outline Background on sequence annotation Protein annotation acceleration Background Filter design for sequence-phmm comparison Filter design for phmm-phmm comparison Noncoding RNA annotation acceleration Research summary 8

Protein (Domain) Family Sequences evolving from the same ancestor & having similar function Multiple sequence alignment Identify conserved residues seq1 A AA P P Q seq2 - A - P P Q Ins Ins Ins Ins seq3 G Y - P P A seq4 A Y - P P P seq5 G Y - P P P seq6 G Y - P P Q GYPPQ G Y P P Q seq7 A Y - A A Q XYPPX repeat S B Del Del Del l A:0.01 G:0.02 V:0.01 D:0.02 A:0.02 G:0.01 P:0.89 Profile HMM (phmm): compute the generation probability for a sequence: [Rabiner 89],[Durbin et al. 98] High generation probability member sequence E 9

Annotation Case Study Human gut microbiome initiative (HGMI) 100 bacterial genomes: http://www.genome.wustl.edu/hgm Protein annotation: compare 5*10 5 hypothetical proteins with Pfam (30,000) Needs CPU-days New protein families from HGMI Compare to NCBI nr protein database CPU-hours per family 10

Acceleration Strategy: Filtration Design filter for each protein family Does a sequence match the designed filter? Yes full comparison with the phmm No ignore it F(amily) protein database Member of F not member of F filter match 11

Filter Design Problem Sensitivity: # of filter matched members # of members False positive rate (FP rate): # of filter matched background sequences # of background sequences Significantly affects the speedup Goal: design a filter with maximum sensitivity while keeping its fp rate a given threshold 12

Outline Background on sequence annotation Protein annotation acceleration Background Filter design for sequence-phmm alignment Profile-based filters Pattern-based filters Filter design for phmm-phmm alignment Noncoding RNA annotation acceleration Research interests and future work 13

Profile and Score Function [Gribskov, 87] seq 1: seq 2:.. seq i: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8.................. e.g. score ( AEEMRIGC,Profile) =? -2+(-3)+(-3)+(-4)+(-5)+(-3)+10+(-6) A -2-2 0-2 -4-2 -4-4 C 10-1 0-2 -4-3 -3-6 D -1 1-3 10-5 -1-2 -6 E -1-3 -3-4 -6 0-1 -4 F -1-5 0 8-5 -3-2 -1 G 0-4 10-4 -4 10 10-2 H -1-7 -6-7 -3-7 -7-5 I -1 0-2 -1-6 -3 0-2 K -2-3 -3-4 -4-4 -5-4 L -5-5 -2-4 -7-4 -1-5 M -2-4 -2-4 -4-3 -3-3 N -3 5-4 -5-6 -5-6 -4 P -3-6 -4-5 -6-5 -6 9 Q -2 4-6 -5-6 -5-5 -6 R -4-6 -7-7 -5-5 -6-7 S -1-6 0-6 10-4 -6-6 T 0-4 -5-5 -2-2 -5-5 V 0-1 -4-2 -5-2 0 0 W -1-1 -1-3 -4-1 -1-4 Y -1-2 -1-3 -2-1 -1-2 14

Profile Design Based on PHMM Profile match: score(s,profile) T Sensitivity: Pr(score(s,Profile) T s ~ phmm M) False positive rate (FP rate): Pr(score(s,Profile) T s ~ background model M 0 ) residue distribution in a random sequence profile HMM Central design problem: design P and T from phmm, s.t. P has maximum sensitivity while keeping its fp rate a given threshold r 15

Profile Design Framework profile extraction from phmm choose top profiles in order of sensitivity compute score threshold T to achieve given FP rate r compute profile s sensitivity given T 16

Evaluate a Profile P Given a FP rate threshold r, find score threshold T s.t. Pr(score(s,Profile) T s ~ background model M 0 ) r Evaluate P s sensitivity under T: Pr(score(s,Profile) T s ~ M) We have efficient algorithm to compute T and sensitivity of P given r Score probability score 17

Design PROSITE-like Patterns for Highly Conserved Protein Families Combinatorial problem: H-x(3)-[HC] Pattern length not fixed a priori Large design space: ~350 columns on average, ~2 20 sets of residues at each column Map pattern design into a knapsack problem Design an approximation algorithm Prove a bound: sensitivity( P) ((1 ε ) ε sensitivity( P m: # of elements in P*, :approximation ratio m * ) )) (1 + ε 19

Data and Experiments Design profiles and patterns for ~1000 phmms randomly chosen from Pfam-A Sensitivity: fraction of true member sequences containing a match to the filter Speedup T T original filtered search search time of searching for phmm matches in Swiss- Prot DB time of searching for phmm matches in Swiss- Prot DB after using a filter 20

Comparison of Patterns vs. Profiles Profiles have better sensitivity on average Patterns have better speedup for well-conserved family # of protein families 1000 900 800 700 600 500 400 300 200 profile pattern # of protein families 400 350 300 250 200 150 100 profile pattern 100 0 0.15 0.25 0.35 0.45 0.55 0.65 Sensitivity 0.75 0.85 0.95 50 0 1 2 4 8 16 32 64 128 256 512 Speedup Sensitivity comparison of patterns and profiles for 955 protein families Speedup comparison of patterns and profiles for 630 protein families (sensitivity>=0.9) 21

Outline Background on sequence annotation Protein annotation acceleration Background Filter design for sequence-phmm alignment Filter design for phmm-phmm alignment Noncoding RNA annotation acceleration Research interests and future work 22

PHMM-PHMM Alignment Identify remotely related sequences >query sequence (245 residues) methyltransferase enzyme.lryypgspelarrltraqd... PSI-Blast VKKYPGSP LRYYPGSP... -SHYPGSP GDLYPGSP build phmm Application: ~2000 phmms in Pfam are grouped into 283 clans [Finn 06] using phmm-phmm alignment 81: putative structure and function 23

PHMM-PHMM Alignment Identify remotely related sequences >query sequence (245 residues). VKKYPGSPLLARHLLR PSI-Blast VKKYPGSP LRYYPGSP... build phmm -SHYPGSP GDLYPGSP Application: ~2000 phmms in Pfam are grouped into 283 clans [Finn 06] using phmm-phmm alignment 81: putative structure and function 24

PHMM-PHMM Alignment: PRC [Madera 05] Similarity( M1, M 2) phmm 1 phmm 2 = s Σ * (Pr M ( s) Pr ( )) 1 M s 2 Need scoring function between states: dot product between their emission probability vectors PRC is ~7 times slower than sequence-phmm alignment Filtration: represent each state by a symbol in a new alphabet PHMM alignment sequence alignment 25

Degenerate Alphabet-based Filter A filter P contains 3 elements: 1. a degenerate alphabet Δ 2. a function to map the residue emission probability vector in a state into a symbol in Δ C PK. C C. -1082 3898 C C C. 3. a scoring function between two symbols in Δ 26

Degenerate Alphabets for DNA IUPAC (2 4-1=15 elements); PHYLONET [Wang&Stormo 05]; Regulatory potential score [Kolbe et al. 03] (1.0, 0.0, 0.0, 0.0):A (0.7, 0.1, 0.1, 0,1):a (0.0, 0.0, 1.0, 0.0):G (0.4, 0.1, 0.4, 0.1):R. DNA protein: # of residue combinations = 2 20-1 27

Preserve Well-Conserved States 28

Keep at most 3 Residues For a probability vector e 1) Sort e in decreasing order of emission probs. 2) Choose least number of residues with sum probability a threshold (e.g. 0.5) e.g. (P:0.89, A:0.02, ) P; (G:0.32, A:0.30, ) AG 3) If more than 3 residues are needed, output G:0.32 A:0.30 P:0.89 Alphabet size 20 Δ = C + C 20 20 20 0 1 + C2 + C3 = 1531 AG * P P * 29

Experimental Results Test set: 1388 phmm pairs output by PRC with E- value 0.001 Sensitivity: percentage of test set matched by our filters FP rate: # of phmm pairs output by our filters but not PRC 1 0.94 phmm-consensus alignment Sensitivity 0.92 0.9 0.9 0.88 0.8 0.86 0.84 0.7 degenerate alphabet degenerate alphabet 00 0.005 0.005 0.01 0.01 0.015 0.015 0.02 0.025 0.03 0.025 FP rate profile-consensus alignment Sensitivity = 0.938; Speedup = 21 30

PHMM-PHMM Alignment: Summary and Future Work Design the first degenerate alphabet for protein sequence comparison Can be extended to other alignment tools besides PRC Future work Ungapped gapped alignment BLASTP-like alignment rather than full DP 31

Outline Motivation Protein function annotation acceleration Noncoding RNA annotation acceleration Research interests and future work 32

Noncoding RNA (ncrna) Fold into secondary structure Base pair interaction: A-U,U-A,C-G,G-C AUCCGAAAGGAU start end AUCCGAAAGGAU start end t-rna 33

Modeling an ncrna Family Describe both the sequence and secondary structure features shared by a group of ncrna sequences HMM? Cannot describe secondary structure Context-free grammar 0.75 0.25 S AS 1 U US 1 A; S 1 CS 2 G GS 2 C AS 2 U; S 2 GS 3 C; S 3 AS 4 ; S 4 GS 5. Stochastic context-free grammar (SCFG) [Eddy&Durbin 94] Probability that a sequence s is generated by a SCFG M: CYK algorithm O( s 3 M ) 34

NcRNA Annotation-Case Study Building full alignments in Rfam (500+) Use BLAST to find initial set of family members [Eddy 02] Human gut microbiome initiative (HGMI) 100 bacterial genomes: http://www.genome.wustl.edu/hgm Compare 100 genomes to Rfam (500+): CPU years 35

Secondary Structure Profile (SSP) Two components: seeds & profile Use multiple seeds to accommodate insertion and deletion Compute length distribution inside a base pair 36

Data and Experiments NcRNA family database Rfam Over 500 ncrna families Genomic database: 65 Mbase, containing human, mouse, bacterial genome, etc. Design SSP-based filter for each ncrna family Evaluate SSP s sensitivity and speedup 37

Filter Comparison on trna 1 SSP 0.98 Sensitivity 0.96 0.94 0.92 0.9 0.88 0.86 0.84 Singleseed SSP (does not consider 0 gaps) 0.02 0.04 0.06 0.08 regular profile (does not consider base pairs) FP rate Real application: find trnas in E.Coli Infernal: 8.23 hours Using SSPs: sensitivity = 1.0; time = 3.3 minutes 38

Choose between SSP and Regular Profile Many ncrna families have strong sequence conservation Regular profiles have better speedups Design regular profiles for all ncrna families; If (the theoretical sensitivity < a pre-determined threshold) Design SSP for this ncrna family; 13 out of 233 ncrna families choose SSP 39

Histogram of Sensitivity and Speedup of SSP # of ncrna families 250 200 150 100 50 # of ncrna families 40 35 30 25 20 15 10 0 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Sensitivity 5 0 50 100 200 300 400 500 1000 Speedup Sensitivity for 233 ncrna families 84%: sensitivity 0.99 98.7%: sensitivity 0.9 Speedup for 88 random ncrna families on 1M synthetic data set average T cmsearch = 8701 seconds average speedup = 222x 40

Research Summary Similarity search in large scale database, pattern design, pattern match, sequence alignment [Buhler,Keich,and Sun] RECOMB 03 [Sun,Buhler] RECOMB 04 [Buhler, Keich, and Sun] Journal of Computing and Systems Science, 05 [Sun,Buhler] Journal of Computational Biology 05 [Sun,Buhler] BMC Bioinformatics 06 Protein sequence annotation, pattern and profile-based protein motif design [Sun,Buhler] ECBB 06 (Bioinformatics 07) [Sun,Buhler] IEEE/ACM Transactions on Computational Biology and Bioinformatics 08 RNA sequence annotation, RNA motif finding [Sun, Buhler] csb 08 41

Acknowledgments Dr. Jeremy Buhler (advisor) Dr. Michael Brent, Dr. Gary Stormo, Dr. Sally Goldman, Dr. Tao Ju, Dr. Nan Lin Joe, Octav, Rachel, Lu, and Hongtao Funding: NSF DBI-0237902 42