Illuminating Genetic Networks with Random Forest

Size: px
Start display at page:

Download "Illuminating Genetic Networks with Random Forest"

Transcription

1 ? Illuminating Genetic Networks with Random Forest ANDREAS BEYER University of Cologne

2 Outline Random Forest Applications QTL mapping Epistasis (analyzing model structure) 2

3 Random Forest HOW DOES IT WORK? 3

4 Random Forest Response Predictors Samples Y X

5 Random Forest predictor 1 predictor 3 predictor 2 low / class 1 high / class 2 Leo Breiman, 2001

6 Random Forest predictor 1 predictor 3 predictor 2 low / class 1 high / class 2 Leo Breiman, 2001

7 RF uses CART Classification And Regression Trees Breiman et al. (1984)

8 Splitting Rules Classification: Gini Impurity minimize: 1 = fraction of items labeled k m = number possible values

9 Splitting Rules Regression: RSS minimize: left node right node = i th item Yl, Yr = items in left (right) node nl, nr = number of items in left (right) node, = average of left (right) items

10 Decision trees are nice to interpret, but generalize very poorly (large variance)

11 Random Forest Response random sampling Predictors bootstrap Samples Y X

12 Random Forest Grow many trees! Average predictions across all trees

13 Benefits Works very well in practice Very broadly applicable Intuitive algorithm Robust, no overfitting No assumptions about data Virtually no tuning needed Very easily parallelizable Accounts for complex interactions between features 13

14 Drawbacks Difficult interpretation What is the underlying model? Almost impossible to capture analytically 14

15 Predicting Protein-Protein Interactions Human Yeast Elefsinioti et al Molec. Cell. Prot. Sarac et al Bioinformatics 15

16 Proximity measure Score similarities (or differences) of samples = very similar (together in 3/3) = quite similar (together in 2/3) = different (together in 0/3)

17 Weighted Clustering 1. Learn model to predict outcome Features RF Outcome 2. Use feature importance as weights Proximity measure PAM clustering Clusters Michaelson, Trump, et al BMC Genomics 17

18 QTL Analysis 18

19 Genetic Association ACCGTCCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAATTTTGG ACCGACCGACACGTTTGGACAAGTACGTTGCAACACACCCGTACCAATTTTGG ACCGACCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAAAATTGG ACCGTCCCACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG ACCGACCCACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG ACCGTCCGACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG ACCGTCCGACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG 19

20 Genetic Association ACCGTCCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAATTTTGG ACCGACCGACACGTTTGGACAAGTACGTTGCAACACACCCGTACCAATTTTGG ACCGACCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAAAATTGG ACCGTCCCACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG ACCGACCCACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG ACCGTCCGACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG ACCGTCCGACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG 20

21 Published Genome-Wide Associations through 2015 Published GWA at p 5X10-8 for 17 trait categories NHGRI GWA Catalog,

22 Quantitative Trait Loci (QTL) Sub-type of GWAS Must have several causal loci (complex trait) Why? Standard approach: t-test Allele A a

23 RF for QTL mapping Feature Matrix: Genetic Markers RF Trait (e.g. body size) Feature importance = importance of marker 23

24 Pathway Consistency Enrichment of gene pairs in same pathway Random Forest Michaelson et al BMC Genomics 24

25 RF-based QTL Mapping Michaelson et al BMC Syst. Biol. Comparison using real data Ackermann et al PLoS ONE Comparison using simulated data (DREAM) Picotti, Clément-Ziza et al Nature Extracting epistatic interactions Ackermann et al PLoS Genetics Multiple cell types/conditions Clément-Ziza et al Molec. Systems Biol. Non-coding genes, antisense transcription Stephan et al Nat. Commun. Population substructure Valenzano et al Cell Mapping traits in fish 25

26 Epistasis with RF GETTING A GRIP ON RF STRUCTURE JAKE MICHAELSON, MATHIEU CLÉMENT-ZIZA, JAN GROßBACH, CORINNA SCHMALOHR 32

27 What is Epistasis? Non-additive interaction between markers (predictors). Trait epistatic AB Ab ab ab Trait additive AB Ab ab ab Trait epistatic AB Ab ab ab 33

28 Problem with Random Forest Interaction between variables need to know model structure! 34

29 Finding Epistasis with Decision Trees A a A a B b B b B b B b epistatic additive compare slopes

30 Algorithm 1. Learn decision trees 2. Compute slopes (differences) of trait values at splits 3. Collect slopes for left and right sides (there will be many trees) 4. Compare distributions of slopes. Are they different? left right left right epistasis no epistasis Slopes

31 Validation on real data (Saccharomyces cerevisiae) True Positive Rate RF ANOVA Precision False Positive Rate Recall Using Costanzo et al for validation 37

32 Why is RF better? A B C 38

33 Random Forest Extremely versatile Robust Can analyse structure 39

34 Acknowledgements People Money Jan Großbach Johannes Stephan Mathieu Clément-Ziza Corinna Schmalohr Oliver Stegle (EBI) Ruedi Aebersold (ETH) Paola Picotti Jürg Bähler (UCL) Sam Marguerat Xavi Masellach SystemsX.ch The Swiss Initiative in Systems Biology Chris Workman (DTU) Manos Papadakis 40