Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 /
Agenda Probabilistic Classification Introduction to Logistic regression Binary logistic regression Logistic regression: Decision surface Logistic regression: ML estimation Logistic regression: Gradient descent Logistic regression: multi-class Logistic Regression: Regularization Logistic Regression VS. Naïve Bayes 2
Probabilistic Classification Generative probabilistic classification (Previous lecture) motivation: assume a distribution for each class and try to find the parameters for the distributions cons: need to assume distributions; need to fit many parameters Discriminative approach: Logistic regression (Focus of today) motivation: like least square, but assume logistic distribution y(x) = (wtx); classify based on y(x) > 0:5 or not. technique: gradient descent 3
Introduction to Logistic regression Logistic regression represents the probability of category i using a linear function of the input variables: The name comes from the logit transformation: 4
Binary logistic regression Logistic Regression assumes a parametric form for the distribution then directly estimates its parameters from the training data. The parametric model assumed by Logistic Regression in the case where boolean is: P( Y X ) Y is Notice that equation (2) follows directly from equation (1), because the sum of these two probabilities must equal 1. 5
Binary logistic regression We only need one set of parameters: Sigmoid (logistic) function 6
Logistic regression vs. Linear regression Adapted from slides of John Whitehead 7
Logistic regression: Decision surface Given a logistic regression W and an X: Decision surface f(x;w)=constant Decision surfaces are linear functions of x Decision making on Y: 8
Computing the likelihood in details We can re-express the log of the conditional likelihood as: l l l l l l l( w) y ln P( y 1 x, w) (1 y )ln P( y 0 x, w) l l l l Py ( 1 x, w) l l y ln ln P( y 0 x, w) l l Py ( 0 x, w) l n n l l l y w0 wi xi w0 wi xi l i 1 i 1 ( ) ln(1 exp( )) 9
Logistic regression: ML estimation is a concave in w What is a concave and a convex function? No closed form solution 10
Optimizing concave/convex function Maximum of a concave function = minimum of a convex function Gradient ascent (concave) / Gradient descent (convex) 11
Gradient ascent / Gradient descent For function f(w) If f is concave : Gradient ascent rule If f is convex: Gradient descent rule 12
Logistic regression: Gradient descent Iteratively updating the weights in this fashion increases likelihood each round. We eventually reach the maximum We are near the maximum when changes in the weights are small. Thus, we can stop when the sum of the absolute values of the weight differences is less than some small number. 13
Logistic regression: multi-class In the two-class case For multiclass, we work with soft-max function instead of logistic sigmoid Aka Softmax 14
Logistic Regression: Regularization Overfitting the training data is a problem that can arise in Logistic Regression, especially when data has very high dimensions and is sparse. One approach to reducing overfitting is regularization, in which we create a modified penalized log likelihood function, which penalizes large values of w. l l 2 w = arg max ln Py ( x, w) w w l 2 The derivative of this penalized log likelihood function is similar to our earlier derivative, with one additional penalty term l( w) x l ( l ˆ( l 1 l i y P y x, w)) wi wi l which gives us the modified gradient descent rule w w x l ( y l Pˆ ( y l 1 x l, w)) w i i i i l 15
Logistic Regression VS. Naïve Bayes In general, NB and LR make different assumptions NB: Features independent given class -> assumption on P(X Y) LR: Functional form of P(Y X), no assumption on P(X Y) LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed-form solution concave -> global optimum with gradient ascent 16
Logistic Regression VS. Naïve Bayes Consider Y and Xi boolean, X=<X1... Xn> Number of parameters: NB: 2n +1 LR: n+1 Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled 17
Logistic Regression VS. Gaussian Naive Bayes When the GNB modeling assumptions do not hold, Logistic Regression and GNB typically learn different classifier functions Logistic Regression is consistent with the Naïve Bayes assumption that the input features Xi are conditionally independent given Y,it is not rigidly tied to this assumption as is Naive Bayes. GNB parameter estimates converge toward their asymptotic values in order log(n) examples, where n is the dimension of X. Logistic Regression parameter estimates converge more slowly, requiring order (n ) examples. 18
Summary Logistic Regression learns the Conditional Probability Distribution P(y x) Local Search. Begins with initial weight vector. Modifies it iteratively to maximize an objective function. The objective function is the conditional log likelihood of the data: so the algorithm seeks the probability distribution P(y x) that is most likely given the data. 19
Any Question End of Lecture 9 Thank you! Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/ 20