Lecture 6: Decision Tree, Random Forest, and Boosting

Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Tree? Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 2/42

Decision Trees

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Decision Trees Decision trees have a long history in machine learning The first popular algorithm dates back to 979 Very popular in many real world problems Intuitive to understand Easy to build Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 4/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis History EPAM- Elementary Perceiver and Memorizer 96: Cognitive simulation model of human concept learning 966: CLS-Early algorithm for decision tree construction 979: ID3 based on information theory 993: C4.5 improved over ID3 Also has history in statistics as CART (Classification and regression tree) Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 5/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Motivation How do people make decisions? Consider a variety of factors Follow a logical path of checks Should I eat at this restaurant? If there is no wait Yes If there is short wait and I am hungry Yes Else No Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 6/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Decision Flowchart Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 7/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Example: Should We Play Tennis? If temperature is not hot Play If outlook is overcast Play tennis Otherwise Don t play tennis Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 8/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Decision Tree Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 9/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Structure A decision tree consists of Nodes Branches Tests for variables Results of tests Leaves Classification Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting /42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Function Class What functions can decision trees model? Non-linear: very powerful function class A decision tree can encode any Boolean function Proof Given a truth table for a function Construct a path in the tree for each row of the table Given a row as input, follow that path to the desired leaf (output) Problem: exponentially large trees! Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting /42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Smaller Trees Can we produce smaller decision trees for functions? Yes (Possible) Key decision factors Counter examples Parity function: Return on even inputs, on odd inputs Majority function: Return if more than half of inputs are Bias-Variance Tradeoff Bias: Representation Power of Decision Trees Variance: require a sample size exponential in depth Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 2/42

Fitting Decision Trees

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis What Makes a Good Tree? Small tree: Occam s razor: Simpler is better Avoids over-fitting A decision tree may be human readable, but not use human logic! How do we build small trees that accurately capture data? Learning an optimal decision tree is computationally intractable Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 4/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Greedy Algorithm We can get good trees by simple greedy algorithms Adjustments are usually to fix greedy selection problems Recursive: Select the best variable, and generate child nodes: One for each possible value; Partition samples using the possible values, and assign these subsets of samples to the child nodes; Repeat for each child node until all samples associated with a node that are either all positive or all negative. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 5/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Variable Selection The best variable for partition The most informative variable Select the variable that is most informative about the labels Classification error? Example: P(Y = X = ) =.75 v.s. P(Y = X 2 = ) =.55 Which to choose? Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 6/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Information Theory The quantification of information Founded by Claude Shannon Simple concepts: Entropy: H(X) = x P(X = x) log P(X = x) Conditional Entropy: H(Y X) = x P(X = x)h(y X = x) Information Gain: IG(Y X) = H(Y) H(Y X) Select the variable with the highest information gain Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 7/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Example: Should We Play Tennis? H(Tennis) = 3/5 log 2 3/5 2/5 log 2 2/5 =.97 H(Tennis Out. = Sunny) = 2/2 log 2 2/2 /2 log 2 /2 = H(Tennis Out. = Overcast) = / log 2 / / log 2 / = H(Tennis Out. = Rainy) = /2 log 2 /2 2/2 log 2 2/2 = H(Tennis Out.) = 2/5 + /5 + 2/5 = Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 8/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Example: Should We Play Tennis? IG(Tennis Out.) =.97 =.97 If we knew the Outlook we d be able to predict Tennis! Outlook is the variable to pick for our decision tree Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 9/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis When to Stop Training? All data have same label Return that label No examples Return majority label of all data No further splits possible Return majority label of passed data If max IG =? Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 2/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Example: No Information Gain? Y X X 2 Both features give IG Once we divide the data, perfect classification! We need a little exploration sometimes (Unstable Equilibrium) Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 2/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Decision Tree with Continuous Features Decision Stump Threshold for binary classification: More than two classes: Threshold for three classes: Check X j δ j or X j < δ j. Check X j δ j, γ j X j < δ j or γ j < X j. Decompose one node to two modes: First node: Check X j δ j or X j < δ j, Second node: If X j < δ j, check γ j X j or γ j < X j. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 22/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University Decision Tree for Spam Classification email 6/536 ch$<.555 ch$>.555 email 28/77 remove<.6 remove>.6 spam 48/359 hp<.45 hp>.45 email spam spam email 8/65 9/2 26/337 /22 ch!<.9 george<.5capave<2.97 ch!>.9 george>.5capave>2.97 email email spam email spam spam 8/86 /24 6/9 /3 9/ 7/227 george<.5 CAPAVE<2.755 999<.58 george>.5 CAPAVE>2.755 999>.58 email email email spam 8/652 /29 36/23 6/8 hp<.3 free<.65 hp>.3 free>.65 spam email 8/9 / email email email spam 77/423 3/229 6/94 9/29 CAPMAX<.5 business<.45 CAPMAX>.5 business>.45 email email email spam 2/238 57/85 4/89 3/5 receive<.25edu<.45 receive>.25edu>.45 email spam email email 9/236 /2 48/3 9/72 our<.2 our>.2 email spam 37/ /2 Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 23/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University Decision Tree for Spam Classification Sensitivity..2.4.6.8. ROC curve for pruned tree on SPAM data o TREE Error: 8.7%..2.4.6.8. Specificity SPAM Data Overall error rate on test data 8.7%. ROC curve obtained by vary ing the threshold c of the clas sifier: C(X) =+if ˆP (+ X) >c. Sensitivity: proportion of tru spam identified Specificity: proportion of tru email identified. Sensitivity: We may proportion want specificity of true to spam be high, identified and suffer some spam: Specificity: Specificity proportion : 95% of = true Sensitivity email identified. : 79% Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 24/42

CS764/ISYE/CSE 674: Boosting Machine Learning/Computational Trevor Data Analysis Hastie, Stanford University 2 Decision Tree v.s. SVM ROC curve for TREE vs SVM on SPAM data Sensitivity..2.4.6.8. o o SVM Error: 6.7% TREE Error: 8.7% TREE vs SVM Comparing ROC curves on the test data is a good way to compare classifiers. SVM dominates TREE here...2.4.6.8. Specificity SVM outperforms Decision Tree. AUC: Area Under Curve. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 25/42

Bagging and Random Forest

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Toy Example Toy Example - No Noise -3-2 - 2 3-3 -2-2 3 Bayes Error Rate: X2 X Deterministic proble noise comes from sa pling distribution of Use a training sam of size 2. Here Bayes Error %. Nonlinear separable data. Optimal decision boundary: X 2 + X2 2 =. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 27/42

x.<-.783 x.>-.783 CS764/ISYE/CSE 674: Machine Learning/ComputationalBoosting Data Analysis Trevor Hastie, Stanford University 5 Toy Example: Decision Tree Classification Tree Trevor Hastie, Stanford University Decision Boundary: Tree 94/2 x.2<-.67 x.2>-.67 /34 72/66 x.2<.4988 x.2>.4988 4/34 /32 x.<.3632 x.>.3632 23/7 /7 x.<-.9735 x.>-.9735 5/26 2/9 x.<-.668 x.2<-.823968 x.>-.668 x.2>-.823968 /2 5/4 2/8 /83 X2-3 -2-2 3 Error Rate:.73 When th are in sification rather no rule Ĉ(X around 3 /5 4/9-3 -2-2 3 X Sample size: 2 7 branching nodes; 6 layers. Classification error: 7.3% when d = 2; > 3% when d =. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 28/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Model Averaging Decision trees can be simple, but often produce noisy (bushy) or weak (stunted) classifiers. Bagging (Breiman, 996): Fit many large trees to bootstrap-resampled versions of the training data, and classify by majority vote. Boosting (Freund & Shapire, 996): Fit many large or small trees to reweighted versions of the training data. Classify by weighted majority vote. Random Forests (Breiman 999): Fancier version of bagging. In general Boosting>Random Forests>Bagging>Single Tree. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 29/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Bagging/Bootstrap Aggregation Motivation: Average a given procedure over many samples to reduce the variance. Given a classifier C(S, x), based on our training data S, producing a predicted class label at input point x. To bag C, we draw bootstrap samples S,..., S B each of size N with replacement from the training data. Then Ĉ Bag = Mojarity Vote{C(S b, x)} B b=. Bagging can dramatically reduce the variance of unstable procedures (like trees), leading to improved prediction. All simple structures in a tree are lost. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 3/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Toy Example: Bagging Trees Boosting Trevor Hastie, Stanford University 9 Original Tree Bootstrap Tree /3 7/3 x.2<.39 x.2>.39 x.2<.36 x.2>.36 3/2 2/9 /23 /7 x.3<-.575 x.3>-.575 x.<-.965 x.>-.965 2/5 /6 /5 /8 Bootstrap Tree 2 Bootstrap Tree 3 /3 4/3 x.2<.39 x.2>.39 x.4<.395 x.4>.395 3/22 /8 2/25 2/5 x.3<-.575 x.3>-.575 x.3<-.575 x.3>-.575 2/5 /7 2/5 /2 Bootstrap Tree 4 Bootstrap Tree 5 3/3 2/3 x.2<.255 x.2>.255 x.2<.38 x.2>.38 2/6 3/4 4/2 2/ x.3<-.385 x.3>-.385 x.3<-.6 x.3>-.6 2/5 / 2/6 /4 2 branching nodes; 2 layers. 5 dependent trees. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 3/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Toy Example: Bagging Trees Decision Boundary: Bagging X X2-3 -2-2 3-3 -2-2 3 Error Rate:.32 Bagging averages m trees, and produ smoother decision bou aries. A smoother decision boundary. Classification error: 3.2% (Single deeper tree 7.3%). Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 32/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Random Forest Bagging features and samples simultaneously: At each tree split, a random sample of m features is drawn, and only those m features are considered for splitting. Typically m = d or log 2 d, where d is the number of features For each tree grown on a bootstrap sample, the error rate for observations left out of the bootstrap sample is monitored. This is called the out-of-bag error rate. random forests tries to improve on bagging by de-correlating the trees. Each tree has the same expectation. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 33/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 22 Random Forest for Spam Classification ROC curve for TREE, SVM and Random Forest on SPAM data Sensitivity..2.4.6.8. o o o Random Forest Error: 5.% SVM Error: 6.7% TREE Error: 8.7% TREE, SVM and RF Random Forest dominates both other methods on the SPAM data 5.% error. Used 5 trees with default settings for random Forest package in R...2.4.6.8. Specificity RF outperforms SVM. 5 Trees. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 34/42

oosting CS764/ISYE/CSE 674: Machine Trevor Learning/Computational Hastie, Stanford University Data Analysis 4 RandomSpam: Forest: Variable Variable Importance Importance Scores 3d addresses labs telnet 857 45 direct table cs 85 parts # credit lab [ conference report original data project font make address order all hpl technology people pm mail over 65 meeting ; email internet receive business re( 999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $! 2 4 6 8 Relative importance The bootstrap roughly covers /3 samples for each time. Do permutations over variables to check how important they are. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 35/42

Boosting

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 23 Boosting Weighted Sample C M (x) Boosting Weighted Sample Weighted Sample C 3(x) C 2(x) Average many trees, each grown to re-weighted versions of the training data. Final Classifier is weighted average of classifiers: [ M ] C(x) =sign m= α mc m (x) Training Sample C (x) Average many trees, each grown to re-weighted versions of the training data (Iterative). Final Classifier is weighted average of classifiers: Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 37/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 24 Adaboost v.s. Bagging Node Trees Boosting vs Bagging Test Error...2.3.4 Bagging AdaBoost 2 points from Nested Spheres in R Bayes error rate is %. Trees are grown best first without pruning. Leftmost term is a single tree. 2 3 4 Number of Terms Each classifier for bagging is a tree with a single node (decision stump). 2 samples from Spheres in R. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 38/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 25 Adaboost AdaBoost (Freund & Schapire, 996). Initialize the observation weights w i =/N, i =, 2,...,N. 2. For m =tom repeat steps (a) (d): (a) Fit a classifier C m (x) to the training data using weights w i. (b) Compute weighted error of newest tree N i= err m = w ii(y i C m (x i )) N i= w. i (c) Compute α m =log[( err m )/err m ]. (d) Update weights for i =,...,N: w i w i exp[α m I(y i C m (x i ))] and renormalize to w i to sum to. [ M ] 3. Output C(x) = sign m= α mc m (x). w i s are the weights of the samples. err m is the weighted training error. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 39/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Adaboost Boosting Trevor Hastie, Stanford University 26 Test Error...2.3.4.5 Single Stump 4 Node Tree Boosting Stumps A stump is a two-node tree, after a single split. Boosting stumps works remarkably well on the nested-spheres problem. 2 3 4 Boosting Iterations The ensemble of classifiers is much more efficient than the simple combination of classifiers. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 4/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 27 Overfitting of Adaboost Train and Test Error...2.3.4.5 Training Error Nested spheres in - Dimensions. Bayes error is %. Boosting drives the training error to zero. Further iterations continue to improve test error in many examples. 2 3 4 5 6 Number of Terms More iterations continue to improve test error in many examples. The Adaboost is often observed to be robust to overfitting. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 4/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 28 Overfitting of Adaboost Train and Test Error...2.3.4.5 Bayes Error Noisy Problems Nested Gaussians in -Dimensions. Bayes error is 25%. Boosting with stumps Here the test error does increase, but quite slowly. 2 3 4 5 6 Number of Terms Optimal Bayes classification error: 25%. The test error does increase, but quite slowly. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 42/42