Illuminating Genetic Networks with Random Forest
|
|
- Andrea Newman
- 5 years ago
- Views:
Transcription
1 ? Illuminating Genetic Networks with Random Forest ANDREAS BEYER University of Cologne
2 Outline Random Forest Applications QTL mapping Epistasis (analyzing model structure) 2
3 Random Forest HOW DOES IT WORK? 3
4 Random Forest Response Predictors Samples Y X
5 Random Forest predictor 1 predictor 3 predictor 2 low / class 1 high / class 2 Leo Breiman, 2001
6 Random Forest predictor 1 predictor 3 predictor 2 low / class 1 high / class 2 Leo Breiman, 2001
7 RF uses CART Classification And Regression Trees Breiman et al. (1984)
8 Splitting Rules Classification: Gini Impurity minimize: 1 = fraction of items labeled k m = number possible values
9 Splitting Rules Regression: RSS minimize: left node right node = i th item Yl, Yr = items in left (right) node nl, nr = number of items in left (right) node, = average of left (right) items
10 Decision trees are nice to interpret, but generalize very poorly (large variance)
11 Random Forest Response random sampling Predictors bootstrap Samples Y X
12 Random Forest Grow many trees! Average predictions across all trees
13 Benefits Works very well in practice Very broadly applicable Intuitive algorithm Robust, no overfitting No assumptions about data Virtually no tuning needed Very easily parallelizable Accounts for complex interactions between features 13
14 Drawbacks Difficult interpretation What is the underlying model? Almost impossible to capture analytically 14
15 Predicting Protein-Protein Interactions Human Yeast Elefsinioti et al Molec. Cell. Prot. Sarac et al Bioinformatics 15
16 Proximity measure Score similarities (or differences) of samples = very similar (together in 3/3) = quite similar (together in 2/3) = different (together in 0/3)
17 Weighted Clustering 1. Learn model to predict outcome Features RF Outcome 2. Use feature importance as weights Proximity measure PAM clustering Clusters Michaelson, Trump, et al BMC Genomics 17
18 QTL Analysis 18
19 Genetic Association ACCGTCCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAATTTTGG ACCGACCGACACGTTTGGACAAGTACGTTGCAACACACCCGTACCAATTTTGG ACCGACCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAAAATTGG ACCGTCCCACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG ACCGACCCACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG ACCGTCCGACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG ACCGTCCGACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG 19
20 Genetic Association ACCGTCCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAATTTTGG ACCGACCGACACGTTTGGACAAGTACGTTGCAACACACCCGTACCAATTTTGG ACCGACCGACACGTTTGGACAAGTACGCTGCAACACACCCGTACCAAAATTGG ACCGTCCCACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG ACCGACCCACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG ACCGTCCGACACGTTTGGTCAAATACGTTGCAACACACCCGTACCAATTTTGG ACCGTCCGACACGTTTGGTCAAGTACGCTGCAACACACCCGTACCAATTTTGG 20
21 Published Genome-Wide Associations through 2015 Published GWA at p 5X10-8 for 17 trait categories NHGRI GWA Catalog,
22 Quantitative Trait Loci (QTL) Sub-type of GWAS Must have several causal loci (complex trait) Why? Standard approach: t-test Allele A a
23 RF for QTL mapping Feature Matrix: Genetic Markers RF Trait (e.g. body size) Feature importance = importance of marker 23
24 Pathway Consistency Enrichment of gene pairs in same pathway Random Forest Michaelson et al BMC Genomics 24
25 RF-based QTL Mapping Michaelson et al BMC Syst. Biol. Comparison using real data Ackermann et al PLoS ONE Comparison using simulated data (DREAM) Picotti, Clément-Ziza et al Nature Extracting epistatic interactions Ackermann et al PLoS Genetics Multiple cell types/conditions Clément-Ziza et al Molec. Systems Biol. Non-coding genes, antisense transcription Stephan et al Nat. Commun. Population substructure Valenzano et al Cell Mapping traits in fish 25
26 Epistasis with RF GETTING A GRIP ON RF STRUCTURE JAKE MICHAELSON, MATHIEU CLÉMENT-ZIZA, JAN GROßBACH, CORINNA SCHMALOHR 32
27 What is Epistasis? Non-additive interaction between markers (predictors). Trait epistatic AB Ab ab ab Trait additive AB Ab ab ab Trait epistatic AB Ab ab ab 33
28 Problem with Random Forest Interaction between variables need to know model structure! 34
29 Finding Epistasis with Decision Trees A a A a B b B b B b B b epistatic additive compare slopes
30 Algorithm 1. Learn decision trees 2. Compute slopes (differences) of trait values at splits 3. Collect slopes for left and right sides (there will be many trees) 4. Compare distributions of slopes. Are they different? left right left right epistasis no epistasis Slopes
31 Validation on real data (Saccharomyces cerevisiae) True Positive Rate RF ANOVA Precision False Positive Rate Recall Using Costanzo et al for validation 37
32 Why is RF better? A B C 38
33 Random Forest Extremely versatile Robust Can analyse structure 39
34 Acknowledgements People Money Jan Großbach Johannes Stephan Mathieu Clément-Ziza Corinna Schmalohr Oliver Stegle (EBI) Ruedi Aebersold (ETH) Paola Picotti Jürg Bähler (UCL) Sam Marguerat Xavi Masellach SystemsX.ch The Swiss Initiative in Systems Biology Chris Workman (DTU) Manos Papadakis 40