A New Algorithm for Protein-Protein Interaction Prediction
|
|
- Daniel Woods
- 5 years ago
- Views:
Transcription
1 A New Algorithm for Protein-Protein Interaction Prediction Yiwei Li The University of Western Ontario Supervisor: Dr. Lucian Ilie
2 Overview 1 Background 2 Previous approaches Experimental approaches Computational approaches 3 A new algorithm 4 Evaluation 5 Conclusion
3 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion
4 DNA
5 Transcription and translation
6 Protein
7 Protein-protein interaction (PPI)
8 Protein-protein interaction (PPI) Applications Predicting protein functions Development of novel drugs Drug targeting and side effects
9 Protein-protein interaction (PPI) Applications Predicting protein functions Development of novel drugs Drug targeting and side effects Research Knowing PPIs is important There are a lot of unknown PPIs
10 What we want to do Input: Protein Sequence p seq, known interactions k int Output: unknown interactions uk int PPI predict (p seq, k int) { read and process p seq and k int; output uk int; }
11 >YPR161C TTGACMTTGACMTTGGGRRGNGGGGTGACM ACMTTGACTTGACMTGGGGGGQGGGACTTG QSDAVMTTGACMTTGGGGTGGGDGAQSDAV ACMTTGACMTTGACMGGMGGDGGRGACMTT TTGACMTTGA FASTA format
12 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion
13 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion
14 Yeast two-hybrid (Y2H) Processes
15 Yeast two-hybrid (Y2H) Processes Attach DBD and AD
16 Yeast two-hybrid (Y2H) Processes Attach DBD and AD Detect the reporter
17 Tandem affinity purification (TAP) Processes
18 Tandem affinity purification (TAP) Processes Attach a tag
19 Tandem affinity purification (TAP) Processes Attach a tag Tandem wash
20 Tandem affinity purification (TAP) Processes Attach a tag Tandem wash Detect remaining proteins
21 Time and labour consuming Drawbacks
22 Drawbacks Time and labour consuming The false positive and false negative rate are high
23 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion
24 Genomic methods Predict: protein 1 and 2 interact Because: they are close in Genome A, B and D
25 Evolutionary relationship methods Predict: protein 1 and 4 interact Because: they both appear in Genome A and D
26 Protein structure methods Predict: protein 1 and 4 interact Because: the docking parts fit the best
27 Domain methods Known: protein A and B interact via domain 1 and 2 Predict: protein C and D interact Because: C has a part similar to Domain 1, D has a part similar to Domain 2
28 Domain-domain interaction (DDI)
29 Network analysis methods Predict: protein A and C interact Because: connecting edge A-C will form a clique
30 Primary protein structure methods Sequence based, no evolutionary information required Machine learning
31 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion
32 A new algorithm Goal: a more accurate and faster tool for PPI prediction Implemented: C++, OpenMP
33 Idea
34 PPIs are vital Idea
35 Idea PPIs are vital Since domains are responsible for interactions
36 Idea PPIs are vital Since domains are responsible for interactions Domains are preserved during evolution
37 Idea PPIs are vital Since domains are responsible for interactions Domains are preserved during evolution Finding sequence similarity
38 Consecutive seed h i t T T G A C A T G T C T C T A G T G A G T C T C T A T T T A C T A T G T C T C T A G T G A G T C T C A T G ext. ext. Hit and extend in BLAST
39 Consecutive seed: not sensitive enough Spaced seed
40 Spaced seed Consecutive seed: not sensitive enough Spaced seed: *1*1**11
41 Spaced seed Consecutive seed: not sensitive enough Spaced seed: *1*1**11 A N T R A N T R A A N T Q A B Q R A 1 * 1 * 1 * * 1 1
42 Consecutive seed vs spaced seed C G T C A A G A C T T? G A G? C?? T? G?? A C? T T C?????????? C G T C A A G A C T T? G A G? C?? T? G?? A C? T T C? * 1 * * 1 * 1 * * 1 1 * * 1 * * 1 * 1 * * 1 1 * BLAST hits are more clustered than those of the spaced seed.
43 Multiple spaced seeds 8 spaced seeds of weight 5, generated by SpEED (Ilie, 2011) seed 1 = 11*111 seed 2 = 111**1*1 seed 3 = 11***1*11 seed 4 = 1*1***111 seed 5 = 1*1*1**11 seed 6 = 11*****1*11 seed 7 = 11*1**1***1 seed 8 = 11**1****11
44 Steps
45 Encoding seeds and proteins Steps
46 Steps Encoding seeds and proteins Searching domains as similarities
47 Steps Encoding seeds and proteins Searching domains as similarities Scoring domain pairs S(D i, D j ) = log 2 ( d ij d i d j )
48 Steps Encoding seeds and proteins Searching domains as similarities Scoring domain pairs S(D i, D j ) = log 2 ( d ij d i d j ) Scoring protein pairs S(p, q) = max{s(d i, Dj) D i intersects p, D j intersects q}).
49 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion
50 Competing programs & dataset 4 methods in Park s evaluation paper: (Park, 2009)
51 Competing programs & dataset 4 methods in Park s evaluation paper: (Park, 2009) M1: Martin et al. M2: Pitre et al. M3: Shen et al. M4: Guo et al.
52 Competing programs & dataset 4 methods in Park s evaluation paper: (Park, 2009) M1: Martin et al. M2: Pitre et al. M3: Shen et al. M4: Guo et al. Dataset: Yeast
53 Competing programs & dataset 4 methods in Park s evaluation paper: (Park, 2009) M1: Martin et al. M2: Pitre et al. M3: Shen et al. M4: Guo et al. Dataset: Yeast Cross validation
54 Cross validation
55 Receiver Operating Characteristics (ROC) curve
56 Receiver Operating Characteristics (ROC) curve
57 Results, with the other four methods Sensitivity M1 M2 M3 M4 our alg Specificity
58 Results, with the consensus Sensitivity M1 M4 consensus our alg Specificity
59 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion
60 PPI is important, we predict them with a proposed sequence based algorithm
61 PPI is important, we predict them with a proposed sequence based algorithm Our method is better than the competition for most of the important values
62 PPI is important, we predict them with a proposed sequence based algorithm Our method is better than the competition for most of the important values Our method succeeds in surpassing the consensus of the other methods
63 PPI is important, we predict them with a proposed sequence based algorithm Our method is better than the competition for most of the important values Our method succeeds in surpassing the consensus of the other methods Further work: Better identification of domains; improving the entire ROC curve; speed up
64 Thank you!