A New Algorithm for Protein-Protein Interaction Prediction

Size: px
Start display at page:

Download "A New Algorithm for Protein-Protein Interaction Prediction"

Transcription

1 A New Algorithm for Protein-Protein Interaction Prediction Yiwei Li The University of Western Ontario Supervisor: Dr. Lucian Ilie

2 Overview 1 Background 2 Previous approaches Experimental approaches Computational approaches 3 A new algorithm 4 Evaluation 5 Conclusion

3 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion

4 DNA

5 Transcription and translation

6 Protein

7 Protein-protein interaction (PPI)

8 Protein-protein interaction (PPI) Applications Predicting protein functions Development of novel drugs Drug targeting and side effects

9 Protein-protein interaction (PPI) Applications Predicting protein functions Development of novel drugs Drug targeting and side effects Research Knowing PPIs is important There are a lot of unknown PPIs

10 What we want to do Input: Protein Sequence p seq, known interactions k int Output: unknown interactions uk int PPI predict (p seq, k int) { read and process p seq and k int; output uk int; }

11 >YPR161C TTGACMTTGACMTTGGGRRGNGGGGTGACM ACMTTGACTTGACMTGGGGGGQGGGACTTG QSDAVMTTGACMTTGGGGTGGGDGAQSDAV ACMTTGACMTTGACMGGMGGDGGRGACMTT TTGACMTTGA FASTA format

12 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion

13 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion

14 Yeast two-hybrid (Y2H) Processes

15 Yeast two-hybrid (Y2H) Processes Attach DBD and AD

16 Yeast two-hybrid (Y2H) Processes Attach DBD and AD Detect the reporter

17 Tandem affinity purification (TAP) Processes

18 Tandem affinity purification (TAP) Processes Attach a tag

19 Tandem affinity purification (TAP) Processes Attach a tag Tandem wash

20 Tandem affinity purification (TAP) Processes Attach a tag Tandem wash Detect remaining proteins

21 Time and labour consuming Drawbacks

22 Drawbacks Time and labour consuming The false positive and false negative rate are high

23 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion

24 Genomic methods Predict: protein 1 and 2 interact Because: they are close in Genome A, B and D

25 Evolutionary relationship methods Predict: protein 1 and 4 interact Because: they both appear in Genome A and D

26 Protein structure methods Predict: protein 1 and 4 interact Because: the docking parts fit the best

27 Domain methods Known: protein A and B interact via domain 1 and 2 Predict: protein C and D interact Because: C has a part similar to Domain 1, D has a part similar to Domain 2

28 Domain-domain interaction (DDI)

29 Network analysis methods Predict: protein A and C interact Because: connecting edge A-C will form a clique

30 Primary protein structure methods Sequence based, no evolutionary information required Machine learning

31 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion

32 A new algorithm Goal: a more accurate and faster tool for PPI prediction Implemented: C++, OpenMP

33 Idea

34 PPIs are vital Idea

35 Idea PPIs are vital Since domains are responsible for interactions

36 Idea PPIs are vital Since domains are responsible for interactions Domains are preserved during evolution

37 Idea PPIs are vital Since domains are responsible for interactions Domains are preserved during evolution Finding sequence similarity

38 Consecutive seed h i t T T G A C A T G T C T C T A G T G A G T C T C T A T T T A C T A T G T C T C T A G T G A G T C T C A T G ext. ext. Hit and extend in BLAST

39 Consecutive seed: not sensitive enough Spaced seed

40 Spaced seed Consecutive seed: not sensitive enough Spaced seed: *1*1**11

41 Spaced seed Consecutive seed: not sensitive enough Spaced seed: *1*1**11 A N T R A N T R A A N T Q A B Q R A 1 * 1 * 1 * * 1 1

42 Consecutive seed vs spaced seed C G T C A A G A C T T? G A G? C?? T? G?? A C? T T C?????????? C G T C A A G A C T T? G A G? C?? T? G?? A C? T T C? * 1 * * 1 * 1 * * 1 1 * * 1 * * 1 * 1 * * 1 1 * BLAST hits are more clustered than those of the spaced seed.

43 Multiple spaced seeds 8 spaced seeds of weight 5, generated by SpEED (Ilie, 2011) seed 1 = 11*111 seed 2 = 111**1*1 seed 3 = 11***1*11 seed 4 = 1*1***111 seed 5 = 1*1*1**11 seed 6 = 11*****1*11 seed 7 = 11*1**1***1 seed 8 = 11**1****11

44 Steps

45 Encoding seeds and proteins Steps

46 Steps Encoding seeds and proteins Searching domains as similarities

47 Steps Encoding seeds and proteins Searching domains as similarities Scoring domain pairs S(D i, D j ) = log 2 ( d ij d i d j )

48 Steps Encoding seeds and proteins Searching domains as similarities Scoring domain pairs S(D i, D j ) = log 2 ( d ij d i d j ) Scoring protein pairs S(p, q) = max{s(d i, Dj) D i intersects p, D j intersects q}).

49 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion

50 Competing programs & dataset 4 methods in Park s evaluation paper: (Park, 2009)

51 Competing programs & dataset 4 methods in Park s evaluation paper: (Park, 2009) M1: Martin et al. M2: Pitre et al. M3: Shen et al. M4: Guo et al.

52 Competing programs & dataset 4 methods in Park s evaluation paper: (Park, 2009) M1: Martin et al. M2: Pitre et al. M3: Shen et al. M4: Guo et al. Dataset: Yeast

53 Competing programs & dataset 4 methods in Park s evaluation paper: (Park, 2009) M1: Martin et al. M2: Pitre et al. M3: Shen et al. M4: Guo et al. Dataset: Yeast Cross validation

54 Cross validation

55 Receiver Operating Characteristics (ROC) curve

56 Receiver Operating Characteristics (ROC) curve

57 Results, with the other four methods Sensitivity M1 M2 M3 M4 our alg Specificity

58 Results, with the consensus Sensitivity M1 M4 consensus our alg Specificity

59 Background Previous approaches Experimental approaches Computational approaches A new algorithm Evaluation Conclusion

60 PPI is important, we predict them with a proposed sequence based algorithm

61 PPI is important, we predict them with a proposed sequence based algorithm Our method is better than the competition for most of the important values

62 PPI is important, we predict them with a proposed sequence based algorithm Our method is better than the competition for most of the important values Our method succeeds in surpassing the consensus of the other methods

63 PPI is important, we predict them with a proposed sequence based algorithm Our method is better than the competition for most of the important values Our method succeeds in surpassing the consensus of the other methods Further work: Better identification of domains; improving the entire ROC curve; speed up

64 Thank you!