PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme Classification

Size: px

Start display at page:

Download "PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme Classification"

Brendan Taylor
6 years ago
Views:

1 PPDSparse: A Parallel Primal and Dual Sparse Method to Extreme Classification Ian E.H. Yen 1, Xiangru Huang 2, Wei Dai 1, Pradeep Ravikumar 1, Inderjit S. Dhillon 2 and Eric Xing 1 1 Carnegie Mellon University. 2 University of Texas at Austin KDD, 2017 Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

2 Outline 1 Problem Setting Extreme Classification Related Works 2 Algorithm Separable Loss Algorithm Diagram 3 Theory Analysis of primal and dual sparsity 4 Experimental Results Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

3 Outline 1 Problem Setting Extreme Classification Related Works 2 Algorithm Separable Loss Algorithm Diagram 3 Theory Analysis of primal and dual sparsity 4 Experimental Results Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

4 Extreme Classification Goal: Learn a function h(x) : R D R K from D input features to K output scores that is consistent with labels y {0, 1} K. an E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

5 Extreme Classification Goal: Learn a function h(x) : R D R K from D input features to K output scores that is consistent with labels y {0, 1} K. K is large (e.g ). an E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

6 Extreme Classification Goal: Learn a function h(x) : R D R K from D input features to K output scores that is consistent with labels y {0, 1} K. K is large (e.g ). Average number of positive labels (per sample) k p = 1 N N i=1 ( k y ik). Multiclass: k p = 1; Multilabel: k p K. Average number of positive samples (per class) n p = 1 K K k=1 nk p = 1 K K k=1 ( i y ik). n p = Nk p /K N an E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

7 Extreme Classification We consider Linear Classifier: h(x) := W T x where W R D K. an E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

Extreme Classification We consider Linear Classifier: h(x) := W T x where W R D K. Challenge: When K is large, training of simple linear model requires O(NDK) cost. Ian E.H.

8 Extreme Classification We consider Linear Classifier: h(x) := W T x where W R D K. Challenge: When K is large, training of simple linear model requires O(NDK) cost. Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

9 Outline 1 Problem Setting Extreme Classification Related Works 2 Algorithm Separable Loss Algorithm Diagram 3 Theory Analysis of primal and dual sparsity 4 Experimental Results Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

10 Approaches Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Good accuracy when assumption holds. Lower accuracy when assumptions not hold. Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

11 Approaches Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Good accuracy when assumption holds. Lower accuracy when assumptions not hold. Approach 2 Parallelized one-vs-all Good accuracy, slow, parallelizable. Need days on largest dataset with 100 cores. Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

12 Approaches Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Good accuracy when assumption holds. Lower accuracy when assumptions not hold. Approach 2 Parallelized one-vs-all Good accuracy, slow, parallelizable. Need days on largest dataset with 100 cores. Approach 3 Primal-Dual Sparse Good accuracy, fast, not parallelizable, memory issue O(DK). Need days on largest dataset. Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

13 Approaches Approach 1 Structural, i.e. Low-rank or Tree-hierarchy Good accuracy when assumption holds. Lower accuracy when assumptions not hold. Approach 2 Parallelized one-vs-all Good accuracy, slow, parallelizable. Need days on largest dataset with 100 cores. Approach 3 Primal-Dual Sparse Good accuracy, fast, not parallelizable, memory issue O(DK). Need days on largest dataset. This paper Parallel PD-Sparse Good accuracy, fast, parallelizable. Need only < 30 min on largest dataset with 100 cores. Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

14 Outline 1 Problem Setting Extreme Classification Related Works 2 Algorithm Separable Loss Algorithm Diagram 3 Theory Analysis of primal and dual sparsity 4 Experimental Results Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

15 We consider the classwise-separable hinge loss L(z, y) := K l(z k, y k ) = k=1 K max(1 y k z k, 0) k=1 Minimizing a separable loss is equivalent to One-versus-all: ( N K K N ) min l(w T W R D K k x i, y ik ) = l(w T k x i, y ik ) i=1 k=1 k=1 i=1 Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

16 We consider the classwise-separable hinge loss L(z, y) := K l(z k, y k ) = k=1 K max(1 y k z k, 0) k=1 Minimizing a separable loss is equivalent to One-versus-all: ( N K K N ) min l(w T W R D K k x i, y ik ) = l(w T k x i, y ik ) i=1 k=1 To obtain sparse iterates, we add l 1 -penalty on W and add bias per class w 0k. The dual problem of the l 1 -l 2 -regularized problem is: k=1 i=1 min α k R N G(α k ) := 1 2 w(α k) 2 s.t. w(α k ) = prox λ ( ˆX T α k ), 0 α ik 1. N i=1 α ik (1) Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

17 Outline 1 Problem Setting Extreme Classification Related Works 2 Algorithm Separable Loss Algorithm Diagram 3 Theory Analysis of primal and dual sparsity 4 Experimental Results Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

Update α by Coordinate Descent within A k. Ian E.H.

18 Primal-Dual-Sparse Active-set Method ( O nnz(w k )nnz(x j ) }{{} Search ) + nnz(α k )nnz(x i ) per iteration. }{{} Update+Maintain Apply Random Sparsification on (already sparse) w k before search. Update α by Coordinate Descent within A k. Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

Parallel & Primal-Dual Sparse Method Due to the separable loss, the optimization can be embarrassingly parallelized with one-time communication.

19 Parallel & Primal-Dual Sparse Method Due to the separable loss, the optimization can be embarrassingly parallelized with one-time communication. The input y k and output w k of each sub-problem are sparse. Can be implemented in a distributed, shared-memory, or two-level parallelization setting. Space: O(nnz(X ) + D). Nearly linear speedup even with thousands of cores. an E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

20 Outline 1 Problem Setting Extreme Classification Related Works 2 Algorithm Separable Loss Algorithm Diagram 3 Theory Analysis of primal and dual sparsity 4 Experimental Results Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

21 Theory: Primal and Dual Sparsity Key Insight: The number of positive samples for each class n p = Nk p K is small. The following results hold if class-wise bias w k0 are added. Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

22 Theory: Primal and Dual Sparsity Key Insight: The number of positive samples for each class n p = Nk p K is small. The following results hold if class-wise bias w k0 are added. Step-1: bound w 1, and optimal α 1 in terms of n p : w k 1 2nk p λ, α k 1 4n k p. Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

23 Theory: Primal and Dual Sparsity Key Insight: The number of positive samples for each class n p = Nk p K is small. The following results hold if class-wise bias w k0 are added. Step-1: bound w 1, and optimal α 1 in terms of n p : w k 1 2nk p λ, α k 1 4n k p. Step-2: bound nnz(w), and nnz(α) in terms of w 1 and α 1 : nnz( w k ) w k 2 1 δ 2, nnz(α t k ) t 4 α k 2 1 ɛ where w is Random-Sparsified version of w with δ-approximation error in G(α), and ɛ is the desired precision of solution. Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

24 Multilabel Classification Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

25 Multiclass Classification Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

26 Thank you Xiangru Huang Ian E.H. Yen, Xiangru Huang, Wei Dai, Pradeep PPDSparse: Ravikumar, A Parallel Inderjit Primal S. Dhillon and and Dual Eric Sparse Xing Method (shortinst) to Extreme KDD, Classification / 17

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.