Supervised Machine Learning Algorithms for Protein Structure Classifica;on. Pooja Jain February 10, 2009 IMA Seminar

Size: px
Start display at page:

Download "Supervised Machine Learning Algorithms for Protein Structure Classifica;on. Pooja Jain February 10, 2009 IMA Seminar"

Transcription

1 Supervised Machine Learning Algorithms for Protein Structure Classifica;on Pooja Jain February 10, 2009 IMA Seminar

2 Overview What is Structural Classifica;on & Why? SCOP Objec;ve Methodology Machine Learning Results and Discussion Conclusion and Future Work Feb' 10, 2009 IMA Seminar 2

3 Protein Structure Classifica;on To glean informa;on about Structural Similarity Evolu;onary Rela;onships Convergent Evolu;on Same structure, Different origin Divergent Evolu;on Different Structure, Same/Similar origin Cellular Func;on Homologous Same Structure, same or similar func;on (common ancestry) Analogous Same Structure, different func;on (unknown ancestry) Feb' 10, 2009 IMA Seminar 3

4 Structural Classifica;on High throughput Structure determina;on A major challenge is organisa;on of Protein Structure * Science, 2008, doi: /science Current Opinion in Biotechnology, 2001, doi: /s (00) ** Journal of Biomolecular NMR,2004, doi: /a: Feb' 10, 2009 IMA Seminar 4

5 Structural Classifica;on Of Proteins Database of protein structural classifica;on Hierarchical classifica;on for 34,500 Protein Data Bank Entries Seven different structural level for classifica;on Class Fold Super Family Family Domain Species Protein h"p://scop.mrc lmb.cam.ac.uk/scop/ A. G. Murzin, S. E. Brenner, T. Hubbard and C. Chothia, 1995, J. Mol.Biol. 247, Feb' 10, 2009 IMA Seminar 5

6 A Classifica;on Sub Tree from SCOP Feb' 10, 2009 IMA Seminar 6

7 Objec;ve To help maintain the exis;ng classifica;on scheme(s) Iden;fy for a query protein structure, the shared structure level with structurally classified protein(s). Feb' 10, 2009 IMA Seminar 7

8 Methodology One dimensional representa;on of Protein Structural and Sequence only Features Characterise the secondary structure elements (SSEs) in the protein Feb' 10, 2009 IMA Seminar 8

9 One Dimensional Structural Features Calculated using DSSP Assignments (Defini;on of secondary structure of proteins) given a set of 3D coordinates Distance (ρ) Orienta;on (θ) Solvent Accessibility (δ) Length (η) Type (κ) - Helix or Strand Feb' 10, 2009 IMA Seminar 9

10 One Dimensional Sequence Features Sequence Similarity Sequence Length Feb' 10, 2009 IMA Seminar 10

11 Data Set Proteins from ASTRAL Compendium with three SSEs 65,341 pairs of such proteins Removed pairs with > 35% Sequence Iden;ty Paired Profiles for 11,630 pairs of proteins and the deepest common structural level Feb' 10, 2009 IMA Seminar 11

12 Overall Approach Known Classifica;on Unknown Classifica;on (Query) Cons;tuent SSEs Sequence Only Descriptors Cons;tuent SSEs Structural Features Paired Profile of Descriptors Structural Features Supervised ML Algorithm Feb' 10, 2009 IMA Seminar 12

13 Machine Learning Two Staged Procedure Evalua;on of 15 ML algorithms as Base Learners Evalua;on of best Based Learners as Meta Learners Ten fold Cross valida;on Boosted and Bagged Meta Learning Feb' 10, 2009 IMA Seminar 13

14 Results Base Learners Category Base Learners % Accuracy Naïve Bayes Naïve Bayes 62.5 Bayes Net 84.0 Neural Network Simple Logis;c 78.9 Mul;layer Perceptron 90.6 RBFNet 70.5 Decision Trees Decision Tree 51.0 J NBTree 93.6 REPTree 90.8 SCART 92.7 Rule Learners JRip 93.6 PART 93.6 OneR 69.2 Support Vector Machines Kernel Func;on Polynomial 78.9 Kernel Func;on RBF 72.0 Feb' 10, 2009 IMA Seminar 14

15 Results Base Learners (#$"!"#$$% &'"(% )*+,-./#01"2% &#01"2% (" 8J3-*A0"!#'"!#&"!#%"!#$"!" )*+,-.*/-0".*/-0)-1" :" ;6<" <B=C" g means = Feb' 10, 2009 IMA Seminar 15 D2" TP TP + FN. TN TN + FP?%'" ).CE--" =F<CE--" 2GB=C" <75/H" =.>H" =*AI73">7E-01"

16 Results Meta Learners Base Learner Bagged Meta Learner Boosted Meta Learner % Accuracy BayesNet MLP JRip PART J48 NBTree RandomForest Poly-K Feb' 10, 2009 IMA Seminar 16

17 +,-./#0.1# 234# 5%*# # 8,9:;<#=;7./1# 58>?# AB2C4;D-#!"# $%&''# ()!# *+%,# &$*"# $$*(# $$%&#!"# )%)#!&"# )%# &$)&# $*!%# "%"#!E# )%E# )'*#!E(# '()#./012*&34%5# (-)# $&%# *&34%5# )6# )!%# )!(# &($*# &!%%# $*%"#!&(# $E&$# )%'# )E&# "%)# $$!$#!)#!)# '(&# $%!# Training Time (Minutes) for Different Boosted Meta Learners! Feb' 10, 2009 IMA Seminar 17 $)"#

18 Sta;s;cally Accurate Meta Learner $# %# p = 0.05 &#?3#+'<'8%++*+' '#!&#!%# f measure = (1 + α)(p recision.recall) α(p recision + Recall!$# ()*++#,-).# /0123!4*56)7#,*56)7#!"# :;7<5%=9>'."9*+'-*,' 785' 56!/' 0!34' 012' -./)**'!"#$%&'(%)*+,' Feb' 10, 2009 IMA Seminar 18

19 Conclusion Boos;ng in general increases accuracy Boosted Random Forests were significantly accurate (97%) NBTree, comparable performance (96%) computa;onally expensive and long training Random Forests best at predic;ng Super family level Applicable in automa;ng structure classifica;on Feb' 10, 2009 IMA Seminar 19

20 Future Work Structural Classifica;on using Random Forests Feb' 10, 2009 IMA Seminar 20

21 Acknowledgements Jonathan Hirst Jon Garibaldi BIOPTRAIN project Marie Curie Ac;on MEST CT under FP6 EU Feb' 10, 2009 IMA Seminar 21