Lecture 6: Decision Tree, Random Forest, and Boosting

Similar documents
Decision Tree Learning. Richard McAllister. Outline. Overview. Tree Construction. Case Study: Determinants of House Price. February 4, / 31

Today. Last time. Lecture 5: Discrimination (cont) Jane Fridlyand. Oct 13, 2005

Copyright 2013, SAS Institute Inc. All rights reserved.

CS6716 Pattern Recognition

Ensemble Modeling. Toronto Data Mining Forum November 2017 Helen Ngo

Prediction and Interpretation for Machine Learning Regression Methods

Supervised Learning Using Artificial Prediction Markets

Using Decision Tree to predict repeat customers

Decision Trees And Random Forests A Visual Introduction For Beginners

Ranking Potential Customers based on GroupEnsemble method

Startup Machine Learning: Bootstrapping a fraud detection system. Michael Manapat

Data Science in a pricing process

Applications of Machine Learning to Predict Yelp Ratings

Airline Itinerary Choice Modeling Using Machine Learning

Introduction to Random Forests for Gene Expression Data. Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 3.

References. Introduction to Random Forests for Gene Expression Data. Machine Learning. Gene Profiling / Selection

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Random Forests. Parametrization and Dynamic Induction

A Brief History. Bootstrapping. Bagging. Boosting (Schapire 1989) Adaboost (Schapire 1995)

Salford Predictive Modeler. Powerful machine learning software for developing predictive, descriptive, and analytical models.

Tree Depth in a Forest

3 Ways to Improve Your Targeted Marketing with Analytics

Proposal for ISyE6416 Project

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Churn Prediction for Game Industry Based on Cohort Classification Ensemble

Forecasting Intermittent Demand Patterns with Time Series and Machine Learning Methodologies

Random Forests. Parametrization, Tree Selection and Dynamic Induction

PREDICTION OF CONCRETE MIX COMPRESSIVE STRENGTH USING STATISTICAL LEARNING MODELS

Linear model to forecast sales from past data of Rossmann drug Store

SPM 8.2. Salford Predictive Modeler

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Logistic Regression and Decision Trees

WE consider the general ranking problem, where a computer

CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow

Automatic Ranking by Extended Binary Classification

Comparative Study of Gaussian and Z- shaped curve membership function for Fuzzy Classification Approach

Indicator Development for Forest Governance. IFRI Presentation

Decision Support Systems

Rank hotels on Expedia.com to maximize purchases

Model Selection, Evaluation, Diagnosis

Sales prediction in online banking

Metaheuristics. Approximate. Metaheuristics used for. Math programming LP, IP, NLP, DP. Heuristics

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

ADVANCED DATA ANALYTICS

Methodological challenges of Big Data for official statistics

Application research on traffic modal choice based on decision tree algorithm Zhenghong Peng 1,a, Xin Luan 2,b

From Ordinal Ranking to Binary Classification

From Ordinal Ranking to Binary Classification

BUSINESS DATA MINING (IDS 572) Please include the names of all team-members in your write up and in the name of the file.

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara

New restaurants fail at a surprisingly

Insights from the Wikipedia Contest

Text Categorization. Hongning Wang

Text Categorization. Hongning Wang

Accelerating battery development via early prediction of cell lifetime

Utilizing Optimization Techniques to Enhance Cost and Schedule Risk Analysis

BOOSTED TREES FOR ECOLOGICAL MODELING AND PREDICTION

Machine learning in neuroscience

Predicting Restaurants Rating And Popularity Based On Yelp Dataset

Active Learning for Logistic Regression

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

ECONOMIC MODELLING & MACHINE LEARNING

From Ordinal Ranking to Binary Classification

Detecting gene gene interactions that underlie human diseases

Detecting gene gene interactions that underlie human diseases

Predicting ratings of peer-generated content with personalized metrics

In silico prediction of novel therapeutic targets using gene disease association data

Machine Learning Models for Sales Time Series Forecasting

From Ordinal Ranking to Binary Classification

Strength in numbers? Modelling the impact of businesses on each other

When to Book: Predicting Flight Pricing

Enabling Foresight. Skills for Predictive Analytics

Prediction from Blog Data

Predictive Analytics

Advances in Machine Learning for Credit Card Fraud Detection

Dynamic Advisor-Based Ensemble (dynabe): Case Study in Stock Trend Prediction of Critical Metal Companies

Study on Customer Loyalty Prediction Based on RF Algorithm

Project Category: Life Sciences SUNet ID: bentyeh

Big Data. Methodological issues in using Big Data for Official Statistics

Accurate Campaign Targeting Using Classification Algorithms

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

E-Commerce Sales Prediction Using Listing Keywords

Three lessons learned from building a production machine learning system. Michael Manapat

Stock Price Prediction with Daily News

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

C-14 FINDING THE RIGHT SYNERGY FROM GLMS AND MACHINE LEARNING. CAS Annual Meeting November 7-10

FINAL PROJECT REPORT IME672. Group Number 6

Acknowledgments. Data Mining. Examples. Overview. Colleagues

Chapter 3 Top Scoring Pair Decision Tree for Gene Expression Data Analysis

Predicting Corporate Influence Cascades In Health Care Communities

Application of Decision Trees in Mining High-Value Credit Card Customers

CSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification

Intro Logistic Regression Gradient Descent + SGD

Credit Card Marketing Classification Trees

Predictive Modeling Using SAS Visual Statistics: Beyond the Prediction

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Advanced Analytics through the credit cycle

Airbnb Price Estimation. Hoormazd Rezaei SUNet ID: hoormazd. Project Category: General Machine Learning gitlab.com/hoorir/cs229-project.

Watts App: An Energy Analytics and Demand-Response Advisor Tool

Gene Expression Classification: Decision Trees vs. SVMs

Transcription:

Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Tree? Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 2/42

Decision Trees

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Decision Trees Decision trees have a long history in machine learning The first popular algorithm dates back to 979 Very popular in many real world problems Intuitive to understand Easy to build Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 4/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis History EPAM- Elementary Perceiver and Memorizer 96: Cognitive simulation model of human concept learning 966: CLS-Early algorithm for decision tree construction 979: ID3 based on information theory 993: C4.5 improved over ID3 Also has history in statistics as CART (Classification and regression tree) Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 5/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Motivation How do people make decisions? Consider a variety of factors Follow a logical path of checks Should I eat at this restaurant? If there is no wait Yes If there is short wait and I am hungry Yes Else No Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 6/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Decision Flowchart Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 7/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Example: Should We Play Tennis? If temperature is not hot Play If outlook is overcast Play tennis Otherwise Don t play tennis Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 8/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Decision Tree Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 9/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Structure A decision tree consists of Nodes Branches Tests for variables Results of tests Leaves Classification Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting /42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Function Class What functions can decision trees model? Non-linear: very powerful function class A decision tree can encode any Boolean function Proof Given a truth table for a function Construct a path in the tree for each row of the table Given a row as input, follow that path to the desired leaf (output) Problem: exponentially large trees! Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting /42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Smaller Trees Can we produce smaller decision trees for functions? Yes (Possible) Key decision factors Counter examples Parity function: Return on even inputs, on odd inputs Majority function: Return if more than half of inputs are Bias-Variance Tradeoff Bias: Representation Power of Decision Trees Variance: require a sample size exponential in depth Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 2/42

Fitting Decision Trees

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis What Makes a Good Tree? Small tree: Occam s razor: Simpler is better Avoids over-fitting A decision tree may be human readable, but not use human logic! How do we build small trees that accurately capture data? Learning an optimal decision tree is computationally intractable Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 4/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Greedy Algorithm We can get good trees by simple greedy algorithms Adjustments are usually to fix greedy selection problems Recursive: Select the best variable, and generate child nodes: One for each possible value; Partition samples using the possible values, and assign these subsets of samples to the child nodes; Repeat for each child node until all samples associated with a node that are either all positive or all negative. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 5/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Variable Selection The best variable for partition The most informative variable Select the variable that is most informative about the labels Classification error? Example: P(Y = X = ) =.75 v.s. P(Y = X 2 = ) =.55 Which to choose? Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 6/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Information Theory The quantification of information Founded by Claude Shannon Simple concepts: Entropy: H(X) = x P(X = x) log P(X = x) Conditional Entropy: H(Y X) = x P(X = x)h(y X = x) Information Gain: IG(Y X) = H(Y) H(Y X) Select the variable with the highest information gain Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 7/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Example: Should We Play Tennis? H(Tennis) = 3/5 log 2 3/5 2/5 log 2 2/5 =.97 H(Tennis Out. = Sunny) = 2/2 log 2 2/2 /2 log 2 /2 = H(Tennis Out. = Overcast) = / log 2 / / log 2 / = H(Tennis Out. = Rainy) = /2 log 2 /2 2/2 log 2 2/2 = H(Tennis Out.) = 2/5 + /5 + 2/5 = Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 8/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Example: Should We Play Tennis? IG(Tennis Out.) =.97 =.97 If we knew the Outlook we d be able to predict Tennis! Outlook is the variable to pick for our decision tree Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 9/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis When to Stop Training? All data have same label Return that label No examples Return majority label of all data No further splits possible Return majority label of passed data If max IG =? Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 2/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Example: No Information Gain? Y X X 2 Both features give IG Once we divide the data, perfect classification! We need a little exploration sometimes (Unstable Equilibrium) Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 2/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Decision Tree with Continuous Features Decision Stump Threshold for binary classification: More than two classes: Threshold for three classes: Check X j δ j or X j < δ j. Check X j δ j, γ j X j < δ j or γ j < X j. Decompose one node to two modes: First node: Check X j δ j or X j < δ j, Second node: If X j < δ j, check γ j X j or γ j < X j. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 22/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University Decision Tree for Spam Classification email 6/536 ch$<.555 ch$>.555 email 28/77 remove<.6 remove>.6 spam 48/359 hp<.45 hp>.45 email spam spam email 8/65 9/2 26/337 /22 ch!<.9 george<.5capave<2.97 ch!>.9 george>.5capave>2.97 email email spam email spam spam 8/86 /24 6/9 /3 9/ 7/227 george<.5 CAPAVE<2.755 999<.58 george>.5 CAPAVE>2.755 999>.58 email email email spam 8/652 /29 36/23 6/8 hp<.3 free<.65 hp>.3 free>.65 spam email 8/9 / email email email spam 77/423 3/229 6/94 9/29 CAPMAX<.5 business<.45 CAPMAX>.5 business>.45 email email email spam 2/238 57/85 4/89 3/5 receive<.25edu<.45 receive>.25edu>.45 email spam email email 9/236 /2 48/3 9/72 our<.2 our>.2 email spam 37/ /2 Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 23/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University Decision Tree for Spam Classification Sensitivity..2.4.6.8. ROC curve for pruned tree on SPAM data o TREE Error: 8.7%..2.4.6.8. Specificity SPAM Data Overall error rate on test data 8.7%. ROC curve obtained by vary ing the threshold c of the clas sifier: C(X) =+if ˆP (+ X) >c. Sensitivity: proportion of tru spam identified Specificity: proportion of tru email identified. Sensitivity: We may proportion want specificity of true to spam be high, identified and suffer some spam: Specificity: Specificity proportion : 95% of = true Sensitivity email identified. : 79% Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 24/42

CS764/ISYE/CSE 674: Boosting Machine Learning/Computational Trevor Data Analysis Hastie, Stanford University 2 Decision Tree v.s. SVM ROC curve for TREE vs SVM on SPAM data Sensitivity..2.4.6.8. o o SVM Error: 6.7% TREE Error: 8.7% TREE vs SVM Comparing ROC curves on the test data is a good way to compare classifiers. SVM dominates TREE here...2.4.6.8. Specificity SVM outperforms Decision Tree. AUC: Area Under Curve. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 25/42

Bagging and Random Forest

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Toy Example Toy Example - No Noise -3-2 - 2 3-3 -2-2 3 Bayes Error Rate: X2 X Deterministic proble noise comes from sa pling distribution of Use a training sam of size 2. Here Bayes Error %. Nonlinear separable data. Optimal decision boundary: X 2 + X2 2 =. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 27/42

x.<-.783 x.>-.783 CS764/ISYE/CSE 674: Machine Learning/ComputationalBoosting Data Analysis Trevor Hastie, Stanford University 5 Toy Example: Decision Tree Classification Tree Trevor Hastie, Stanford University Decision Boundary: Tree 94/2 x.2<-.67 x.2>-.67 /34 72/66 x.2<.4988 x.2>.4988 4/34 /32 x.<.3632 x.>.3632 23/7 /7 x.<-.9735 x.>-.9735 5/26 2/9 x.<-.668 x.2<-.823968 x.>-.668 x.2>-.823968 /2 5/4 2/8 /83 X2-3 -2-2 3 Error Rate:.73 When th are in sification rather no rule Ĉ(X around 3 /5 4/9-3 -2-2 3 X Sample size: 2 7 branching nodes; 6 layers. Classification error: 7.3% when d = 2; > 3% when d =. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 28/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Model Averaging Decision trees can be simple, but often produce noisy (bushy) or weak (stunted) classifiers. Bagging (Breiman, 996): Fit many large trees to bootstrap-resampled versions of the training data, and classify by majority vote. Boosting (Freund & Shapire, 996): Fit many large or small trees to reweighted versions of the training data. Classify by weighted majority vote. Random Forests (Breiman 999): Fancier version of bagging. In general Boosting>Random Forests>Bagging>Single Tree. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 29/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Bagging/Bootstrap Aggregation Motivation: Average a given procedure over many samples to reduce the variance. Given a classifier C(S, x), based on our training data S, producing a predicted class label at input point x. To bag C, we draw bootstrap samples S,..., S B each of size N with replacement from the training data. Then Ĉ Bag = Mojarity Vote{C(S b, x)} B b=. Bagging can dramatically reduce the variance of unstable procedures (like trees), leading to improved prediction. All simple structures in a tree are lost. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 3/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Toy Example: Bagging Trees Boosting Trevor Hastie, Stanford University 9 Original Tree Bootstrap Tree /3 7/3 x.2<.39 x.2>.39 x.2<.36 x.2>.36 3/2 2/9 /23 /7 x.3<-.575 x.3>-.575 x.<-.965 x.>-.965 2/5 /6 /5 /8 Bootstrap Tree 2 Bootstrap Tree 3 /3 4/3 x.2<.39 x.2>.39 x.4<.395 x.4>.395 3/22 /8 2/25 2/5 x.3<-.575 x.3>-.575 x.3<-.575 x.3>-.575 2/5 /7 2/5 /2 Bootstrap Tree 4 Bootstrap Tree 5 3/3 2/3 x.2<.255 x.2>.255 x.2<.38 x.2>.38 2/6 3/4 4/2 2/ x.3<-.385 x.3>-.385 x.3<-.6 x.3>-.6 2/5 / 2/6 /4 2 branching nodes; 2 layers. 5 dependent trees. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 3/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Toy Example: Bagging Trees Decision Boundary: Bagging X X2-3 -2-2 3-3 -2-2 3 Error Rate:.32 Bagging averages m trees, and produ smoother decision bou aries. A smoother decision boundary. Classification error: 3.2% (Single deeper tree 7.3%). Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 32/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Random Forest Bagging features and samples simultaneously: At each tree split, a random sample of m features is drawn, and only those m features are considered for splitting. Typically m = d or log 2 d, where d is the number of features For each tree grown on a bootstrap sample, the error rate for observations left out of the bootstrap sample is monitored. This is called the out-of-bag error rate. random forests tries to improve on bagging by de-correlating the trees. Each tree has the same expectation. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 33/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 22 Random Forest for Spam Classification ROC curve for TREE, SVM and Random Forest on SPAM data Sensitivity..2.4.6.8. o o o Random Forest Error: 5.% SVM Error: 6.7% TREE Error: 8.7% TREE, SVM and RF Random Forest dominates both other methods on the SPAM data 5.% error. Used 5 trees with default settings for random Forest package in R...2.4.6.8. Specificity RF outperforms SVM. 5 Trees. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 34/42

oosting CS764/ISYE/CSE 674: Machine Trevor Learning/Computational Hastie, Stanford University Data Analysis 4 RandomSpam: Forest: Variable Variable Importance Importance Scores 3d addresses labs telnet 857 45 direct table cs 85 parts # credit lab [ conference report original data project font make address order all hpl technology people pm mail over 65 meeting ; email internet receive business re( 999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $! 2 4 6 8 Relative importance The bootstrap roughly covers /3 samples for each time. Do permutations over variables to check how important they are. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 35/42

Boosting

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 23 Boosting Weighted Sample C M (x) Boosting Weighted Sample Weighted Sample C 3(x) C 2(x) Average many trees, each grown to re-weighted versions of the training data. Final Classifier is weighted average of classifiers: [ M ] C(x) =sign m= α mc m (x) Training Sample C (x) Average many trees, each grown to re-weighted versions of the training data (Iterative). Final Classifier is weighted average of classifiers: Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 37/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 24 Adaboost v.s. Bagging Node Trees Boosting vs Bagging Test Error...2.3.4 Bagging AdaBoost 2 points from Nested Spheres in R Bayes error rate is %. Trees are grown best first without pruning. Leftmost term is a single tree. 2 3 4 Number of Terms Each classifier for bagging is a tree with a single node (decision stump). 2 samples from Spheres in R. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 38/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 25 Adaboost AdaBoost (Freund & Schapire, 996). Initialize the observation weights w i =/N, i =, 2,...,N. 2. For m =tom repeat steps (a) (d): (a) Fit a classifier C m (x) to the training data using weights w i. (b) Compute weighted error of newest tree N i= err m = w ii(y i C m (x i )) N i= w. i (c) Compute α m =log[( err m )/err m ]. (d) Update weights for i =,...,N: w i w i exp[α m I(y i C m (x i ))] and renormalize to w i to sum to. [ M ] 3. Output C(x) = sign m= α mc m (x). w i s are the weights of the samples. err m is the weighted training error. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 39/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Adaboost Boosting Trevor Hastie, Stanford University 26 Test Error...2.3.4.5 Single Stump 4 Node Tree Boosting Stumps A stump is a two-node tree, after a single split. Boosting stumps works remarkably well on the nested-spheres problem. 2 3 4 Boosting Iterations The ensemble of classifiers is much more efficient than the simple combination of classifiers. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 4/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 27 Overfitting of Adaboost Train and Test Error...2.3.4.5 Training Error Nested spheres in - Dimensions. Bayes error is %. Boosting drives the training error to zero. Further iterations continue to improve test error in many examples. 2 3 4 5 6 Number of Terms More iterations continue to improve test error in many examples. The Adaboost is often observed to be robust to overfitting. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 4/42

CS764/ISYE/CSE 674: Machine Learning/Computational Data Analysis Boosting Trevor Hastie, Stanford University 28 Overfitting of Adaboost Train and Test Error...2.3.4.5 Bayes Error Noisy Problems Nested Gaussians in -Dimensions. Bayes error is 25%. Boosting with stumps Here the test error does increase, but quite slowly. 2 3 4 5 6 Number of Terms Optimal Bayes classification error: 25%. The test error does increase, but quite slowly. Tuo Zhao Lecture 6: Decision Tree, Random Forest, and Boosting 42/42