Intro Logistic Regression Gradient Descent + SGD

Similar documents
Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

CSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification

Convex and Non-Convex Classification of S&P 500 Stocks

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

Predictive Modelling for Customer Targeting A Banking Example

WE consider the general ranking problem, where a computer

Data Mining. Chapter 7: Score Functions for Data Mining Algorithms. Fall Ming Li

New restaurants fail at a surprisingly

Predicting prokaryotic incubation times from genomic features Maeva Fincker - Final report

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning. Haoyu Zhang*, Logan Stafman*, Andrew Or, Michael J. Freedman

Data mining and Renewable energy. Cindi Thompson

Predicting Corporate Influence Cascades In Health Care Communities

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara

A Competitive Facility Location Game with Traffic Congestion Costs

Predicting Restaurants Rating And Popularity Based On Yelp Dataset

Fast Model Predictive Control for Magnetic Plasma Control MPC using fast online 1st-order QP methods

Using Decision Tree to predict repeat customers

TERA-SCALE MACHINE LEARNING

Supervised Learning Using Artificial Prediction Markets

Learning Based Admission Control. Jaideep Dhok MS by Research (CSE) Search and Information Extraction Lab IIIT Hyderabad

WaterlooClarke: TREC 2015 Total Recall Track

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

Startup Machine Learning: Bootstrapping a fraud detection system. Michael Manapat

Finding Hidden Intelligence with Predictive Analysis of Data Mining

Neural Networks and Applications in Bioinformatics. Yuzhen Ye School of Informatics and Computing, Indiana University

Airbnb Price Estimation. Hoormazd Rezaei SUNet ID: hoormazd. Project Category: General Machine Learning gitlab.com/hoorir/cs229-project.

Neural Networks and Applications in Bioinformatics

A comparative study of Linear learning methods in Click-Through Rate Prediction

Who Is Likely to Succeed: Predictive Modeling of the Journey from H-1B to Permanent US Work Visa

Dallas J. Elgin, Ph.D. IMPAQ International Randi Walters, Ph.D. Casey Family Programs APPAM Fall Research Conference

DATA SCIENCE: HYPE AND REALITY PATRICK HALL

CS 769 Course Project Efficiently Maintaining Conditional Random Fields

PREDICTION OF CONVERSION RATES IN ONLINE MARKETING

Churn Prediction with Support Vector Machine

^ Springer. The Logic of Logistics. Theory, Algorithms, and Applications. for Logistics Management. David Simchi-Levi Xin Chen Julien Bramel

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Machine Learning Choices

From Ordinal Ranking to Binary Classification

Multi armed Bandits on the Web

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Predicting Volatility in Equity Markets Using Macroeconomic News

This is a refereed journal and all articles are professionally screened and reviewed

AI is the New Electricity

A Personalized Company Recommender System for Job Seekers Yixin Cai, Ruixi Lin, Yue Kang

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

HUMAN RESOURCE PLANNING AND ENGAGEMENT DECISION SUPPORT THROUGH ANALYTICS

Unravelling Airbnb Predicting Price for New Listing

Multi-objective optimization

OPTIMUM CAPACITY EXPANSION Prepared by: Ioanna Boulouta under the supervision of Professor Richard de Neufville

FINANCE + DEEP LEARNING SKYMIND / DEEPLEARNING4J 2015

Rank hotels on Expedia.com to maximize purchases

Statistical Tools for Evaluating the Behavior of Rival Forms: Logistic Regression, Tree & Forest, and Naive Discriminative Learning

Metaheuristics for scheduling production in large-scale open-pit mines accounting for metal uncertainty - Tabu search as an example.

From Ordinal Ranking to Binary Classification

Predict Commercial Promoted Contents Will Be Clicked By User

Collaborative Embedding Features and Diversified Ensemble for E-Commerce Repeat Buyer Prediction

Automatic Ranking by Extended Binary Classification

Predicting user rating on Amazon Video Game Dataset

Using Twitter as a source of information for stock market prediction

ESD.71 / / 3.56 / Engineering Systems Analysis for Design

Uplift Modeling : Predicting incremental gains

Churn Prediction for Game Industry Based on Cohort Classification Ensemble

PAST research has shown that real-time Twitter data can

Improving the Accuracy of Base Calls and Error Predictions for GS 20 DNA Sequence Data

Proposal for ISyE6416 Project

The Wisdom Of the Crowds: Enhanced Reputation-Based Filtering

Data Science and Technology Entrepreneurship (Spring 2013)

Active Learning for Conjoint Analysis

Linear, Machine Learning and Probabilistic Approaches for Time Series Analysis

CS425: Algorithms for Web Scale Data

Discrete facility location problems

PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme Classification

CS 345 Data Mining. Online algorithms Search advertising

Real Data Analysis at PNC

Predicting Part Lead Times: Big Data Techniques

Penalizing Your Models: An Overview of the Generalized Regression Platform Michael Crotty, SAS Institute; Clay Barker, SAS Institute

Modeling of competition in revenue management Petr Fiala 1

Incorporating AI/ML into Your Application Architecture. Norman Sasono CTO & Co-Founder, bizzy.co.id

Exploitation and Exploration in a Performance based Contextual Advertising System

REVIEW ON PREDICTION OF CHRONIC KIDNEY DISEASE USING DATA MINING TECHNIQUES

Predicting Airbnb Bookings by Country

Big Data. Methodological issues in using Big Data for Official Statistics

COGNITIVE COGNITIVE COMPUTING _

Tutorial Segmentation and Classification

E-Commerce Sales Prediction Using Listing Keywords

Fraud Detection for MCC Manipulation

Machine Learning Based Prescriptive Analytics for Data Center Networks Hariharan Krishnaswamy DELL

Guagua: An Iterative Computing Framework on Hadoop. Zhang Pengshan(David), PayPal

2015 The MathWorks, Inc. 1

Ensemble Modeling. Toronto Data Mining Forum November 2017 Helen Ngo

An Introduction to Iterative Combinatorial Auctions

Price Prediction Evolution: from Economic Model to Machine Learning

Stochastic Convex Sparse Principal Component Analysis

Using Salience to Segment Desktop Activity into Projects

Appendix A Mixed-Effects Models 1. LONGITUDINAL HIERARCHICAL LINEAR MODELS

Introduction to Reinforcement Learning. CS : Deep Reinforcement Learning Sergey Levine

CS6716 Pattern Recognition

IBM SPSS & Apache Spark

Cross-channel measurement and optimization: Targeting mobile app usage to increase desktop brand engagement Gilad Barash, Brian Dalessandro, Claudia

Transcription:

Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade March 29, 2016 1

Ad Placement Strategies Companies bid on ad prices Which ad wins? (many simplifications here) Naively: But: Instead: 2

Key Task: Estimating Click Probabilities What is the probability that user i will click on ad j Not important just for ads: Optimize search results Suggest news articles Recommend products Methods much more general, useful for: Classification Regression Density estimation 3

Learning Problem for Click Prediction Prediction task: Features: Data: Batch: Online: Many approaches (e.g., logistic regression, SVMs, naïve Bayes, decision trees, boosting, ) Focus on logistic regression; captures main concepts, ideas generalize to other approaches 4

Logistic Regression Learn P(Y X) directly Assume a particular functional form Sigmoid applied to a linear function of the data: Logistic function (or Sigmoid): Z Features can be discrete or continuous! 5

Very convenient! 0 implies 1 0 linear classification rule! 6

Digression: Logistic regression more generally Logistic regression in more general case, where Y in {y 1,,y R } for k<r for k=r (normalization, so no weights for this class) Features can be discrete or continuous! 7

Loss function: Conditional Likelihood Have a bunch of iid data of the form: Discriminative (logistic regression) loss function: Conditional Data Likelihood 8

Expressing Conditional Log Likelihood 9

Maximizing Conditional Log Likelihood Good news: l(w) is concave function of w, no local optima problems Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize 10

Optimizing concave function Gradient ascent Conditional likelihood for logistic regression is concave Find optimum with gradient ascent Gradient: Update rule: Step size, >0 Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading) 11

Gradient Ascent for LR Gradient ascent algorithm: iterate until change < (t) For i = 1,,d, (t) repeat 12

Regularized Conditional Log Likelihood If data are linearly separable, weights go to infinity Leads to overfitting Penalize large weights Add regularization penalty, e.g., L 2 : Practical note about w 0 : 13

Standard v. Regularized Updates Maximum conditional likelihood estimate (t) Regularized maximum conditional likelihood estimate (t) 14

Stopping criterion Regularized logistic regression is strongly concave Negative second derivative bounded away from zero: Strong concavity (convexity) is super helpful!! For example, for strongly concave l(w): 15

Convergence rates for gradient descent/ascent Number of iterations to get to accuracy If func l(w) Lipschitz: O(1/ϵ 2 ) If gradient of func Lipschitz: O(1/ϵ) If func is strongly convex: O(ln(1/ϵ)) 16

Challenge 1: Complexity of computing gradients What s the cost of a gradient update step for LR??? (t) 17

Challenge 2: Data is streaming Assumption thus far: Batch data But, click prediction is a streaming data task: User enters query, and ad must be selected: Observe x j, and must predict y j User either clicks or doesn t click on ad: Label y j is revealed afterwards Google gets a reward if user clicks on ad Weights must be updated for next time: 18

Learning Problems as Expectations Minimizing loss in training data: Given dataset: Sampled iid from some distribution p(x) on features: Loss function, e.g., hinge loss, logistic loss, We often minimize loss in training data: However, we should really minimize expected loss on all data: So, we are approximating the integral by the average on the training data 19

Gradient Ascent in Terms of Expectations True objective function: Taking the gradient: True gradient ascent rule: How do we estimate expected gradient? 20

SGD: Stochastic Gradient Ascent (or Descent) True gradient: Sample based approximation: What if we estimate gradient with just one sample??? Unbiased estimate of gradient Very noisy! Called stochastic gradient ascent (or descent) Among many other names VERY useful in practice!!! 21

Stochastic Gradient Ascent: General Case Given a stochastic function of parameters: Want to find maximum Start from w (0) Repeat until convergence: Get a sample data point x t Update parameters: Works in the online learning setting! Complexity of each gradient step is constant in number of examples! In general, step size changes with iterations 22

Stochastic Gradient Ascent for Logistic Regression Logistic loss as a stochastic function: Batch gradient ascent updates: Stochastic gradient ascent updates: Online setting: 23

Convergence Rate of SGD Theorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded Then, for step sizes: The expected loss decreases as O(1/t): 24

Convergence Rates for Gradient Descent/Ascent vs. SGD Number of Iterations to get to accuracy Gradient descent: If func is strongly convex: O(ln(1/ϵ)) iterations Stochastic gradient descent: If func is strongly convex: O(1/ϵ) iterations Seems exponentially worse, but much more subtle: Total running time, e.g., for logistic regression: Gradient descent: SGD: SGD can win when we have a lot of data See readings for more details 25

What you should know about Logistic Regression (LR) and Click Prediction Click prediction problem: Estimate probability of clicking Can be modeled as logistic regression Logistic regression model: Linear model Gradient ascent to optimize conditional likelihood Overfitting + regularization Regularized optimization Convergence rates and stopping criterion Stochastic gradient ascent for large/streaming data Convergence rates of SGD 26