by Xindong Wu, Kui Yu, Hao Wang, Wei Ding

Size: px

Start display at page:

Download "by Xindong Wu, Kui Yu, Hao Wang, Wei Ding"

Merilyn Golden
6 years ago
Views:

1 Online Streaming Feature Selection by Xindong Wu, Kui Yu, Hao Wang, Wei Ding 1

2 Outline 1. Background and Motivation 2. Related Work 3. Notations and Definitions 4. Our Framework for Streaming Feature Selection 5. Online Streaming Feature Selection Algorithms 6. Experimental Results 2

3 1. Background and Motivation Traditional feature selection assumes that all features are available and presented to a learner before feature selection takes place Streaming feature selection: features are generated dynamically and arrive one at a time while the number of observations is left constant Example 1: Texture-based image segmentation assigns a label to each pixel in a training image according to its texture type, and an image might easily contain tens of thousands of labeled pixels, hence the computational cost is expensive in generating those features. Thus, we could spend a long time on collecting those features (Perkins & Theiler 2003) Example 2: the feature set size is unknown, or even infinite. 3

4 Challenge with Streaming Features Do we develop a new way to integrate the new feature as it arrives and begin the computation, or Spend a long time waiting for all generated features and then adopt existing algorithms? 4

5 Contributions 1. One step further on feature relevance and with explicit feature redundancy between a feature and a target class; 2. A novel framework based on feature relevance to manage streaming feature selection; 3. Two new online streaming feature selection algorithms are with comparative studies. 5

6 2. Related Work Perkins and Theiler (2003): a grafting algorithm based on a stagewise gradient descent approach for streaming feature selection grafting needs to determine the value of the tuning parameter λ in advance. Zhou et al. (2005; 2006): two algorithms based on streamwise regression, Information-investing and Alpha-investing for streaming feature selection both need prior knowledge about the structure of the feature space to heuristically control the choice of candidate feature selection. 6

7 3. Notations and Definitions Let V be a full set of features, Xi denote the ith input feature, and X \i represent all input features excluding X i. Definition 1 (Conditional independence) Two features X and Y are conditionally independent d given the set of features Z, if and only if P(X Y,Z)=P(X Z), denoted as Ind(X,Y Z). Accordingly, conditional dependence as Dep(X,Y Z). Definition 2 (Strong relevance) X i is strongly relevant to a target T if P(T X \ i ) P(T X \i,x Definition 3 (Weak relevance) X i is weakly relevant to a target T if X i is not strongly relevant and S X : P(T S) P(T S,X ) \ i i Definition 4 (Irrelevance) X i is irrelevant to a target T if it is neither strongly nor weakly relevant, if S X \i : P(T S) = ) i P(T S,X ) i 7

8 Notations and Definitions (2) Definition 5 (Markov blanket) Given a feature X i, assuming M i V, M i is a Markov blanket for X i, if and only if P( V M { X }, T X, M ) = PV ( M { X }, T M ) ( i i i i i i i Definition 6 (Redundant feature-1) A feature is redundant and should be removed from V (the current set of features), if and only if it is weakly relevant and has a Markov blanket M i within V. Rewrite Definition 6, Definition 7 (Redundant feature-2) Given a candidate Markov blanket of a target feature T, denoted as CMB(T), and afeaturex CMB(T), X is redundant to T, if and only if S CMB(T) : P(T X,S) = P(T S) 8

9 4. A Framework for Streaming Feature Selection 1. Initialization Best candidate feature set BCF={}, the target feature T 2. Online relevance analysis (1) Generate a new feature X (2) Determine whether X is irrelevant to T or not. a. If X is irrelevant to T, then disregarded; b. Otherwise, X is added to BCF 3. Online Redundancy analysis Online identify redundant features from the current subset BCF and remove them by Definition 7 4. Alternate Steps 2 and 3 until the stopping criteria are satisfied 5. Output BCF. 9

10 5. Online Streaming Feature Selection Algorithms OSFS: Online Streaming Feature Selection Algorithm Fast-OSFS: A fast version of OSFS 10

11 OSFS: Online Streaming Feature Selection OSFS finds an optimal subset using a two-phase scheme: online relevance analysis (steps 4-12) and online redundancy analysis (steps 13-21) (See the pseudo-code of OSFS on next page) Relevance analysis: discovers strongly and weakly relevant features and adds them into BCF accordingly When a new feature arrives, OSFS assesses whether it is irrelevant to the class label C; if so, it is discarded, otherwise it is added to BCF Redundancy analysis: if a new feature enters BCF, this phase dynamically eliminates redundant features within BCF If there exists a subset within BCF to make Y and C conditionally independent, Y is removed from BCF OSFS alternates the two phases till some stopping criteria are satisfied. 11

12 The pseudo-code of OSFS 12

13 Time complexity of OSFS Depends on the number of independent tests. At time t, assuming V features are arriving, then the worstcase complexity is O( V BCF k BCF ) where k is the maximum allowable size that a conditioning set may grow. Assuming SF V, SF << V where SF contains all strongly relevant features, then the average time complexity is O( SF BCF k BCF ) at time t. 13

14 Time complexity of OSFS The most time-consuming part is the redundancy d analysis phase. When a new feature enters BCF, redundancy analysis will re-examine examine each feature within BCF with respect to its relevance to C. In order to further improve the selection efficiency, Fast-OSFS is designed on next page. 14

15 The Fast-OSFS Algorithm 15

16 The Fast-OSFS Algorithm The key difference is that t Fast-OSFS divides id the redundancy analysis phase into two phases inner-redundancy analysis and outer-redundancy analysis Fast-OSFS only alternates the relevance analysis and the inner-redundancy analysis phase In inner-redundancy analysis, Fast-OSFS only reexamines the feature just added into BCF In outer-redundancy redundancy analysis, it re-examines examines each feature of BCF only when the process of generating a feature is stopped. 16

17 Time complexity of Fast-OSFS The worst-case complexity is O( V k BCF + BCF k BCF ) The average is O( SF k BCF + BCF k BCF ) at time t. 17

18 6. Experimental Results Data sets: 8 UCI benchmark databases and 10 challenge databases Three classifiers: k-nn, J48 and Randomforest (Spider 2010), and selected the best accuracy as the result Grafting and Alpha-investing were performed using their original implementations. The tuning parameter λ for Grafting: selected using cross-validation The parameters of Alpha-investing: default settings, W 0 =0.5 and a Δ =0.5. The conditional independence tests in our implementation are G 2 tests and the parameter alpha is the statistical significance level. 18

19 Results on UCI Benchmark Data Sets 19

20 The win/tie/loss counts of our methods vs. other methods OSFS Fast-OSFS Grafting 5/1/2 4/0/4 Alphainvesting 7/0/1 5/1/2 Note: Alpha-investing selects all features on the Wdbc data. The compactness and predictive accuracy of 4 algorithms (alpha=0.01) 20

21 The win/tie/loss counts of our methods vs. other methods OSFS Fast-OSFS Grafting 3/2/3 4/1/3 Alphainvesting 7/0/1 8/0/0 Note: Alpha-investing selects all features on the Wdbc data. The compactness and predictive accuracy of 4 algorithms (alpha=0.05) 21

22 OSFS performance with different alpha values Fast-OSFS performance with different alpha values 22

23 An performance analysis with different alpha values When alpha is up to 0.05, our two algorithms tend to select more features, but the accuracy of them is different. OSFS degrades a little while Fast-OSFS improves a little. When alpha is equal to 0.01 and up to 0.05, two algorithms have similar performance in our experiments. 23

24 Results on Challenge Data Sets 24

25 Alphainvesting failed to select any features. Grafting The fails win/tie/loss to select any counts features of our on methods the Dorohthea vs. other and Breastcancer data because of the problem methods of out of memory OSFS Fast-OSFS Grafting 8/0/2 7/0/3 The compactness and prediction accuracy (%) of four algorithms (alpha=0.01) Alpha- 8/0/2 6/0/4 investing 25

26 Running Time Analysis The time reported is the normalized time: the running time of OSFS for a data set divided by the corresponding running time of Fast-OSFS. A greater normalized running time than one implies that OSFS is slower than Fast-OSFS on the same learning task. 26

27 Running Time Analysis On the UCI data sets, Fast-OSFS is at least twice faster than OSFS. Since the running time of Fast-OSFS and OSFS is less than one second on most of these data sets, we only report the running time longer than ten seconds on five data sets in Figure 8 (left: alpha=0.01; right: alpha=0.05). 27

28 Discussions: Grafting & Alpha-Investing Grafting: with a low dimensional data set, it is competitive with our methods; with a high dimensional data set, it is inferior to our methods. Its main drawback: it needs to choose the tuning parameter λ in advance. Alpha-investing: our algorithms outperform Alphainvesting on most of the 18 datasets. With prior knowledge of the structure of the candidate features, Alpha-investing could achieve good performance. If with prior knowledge, our framework can also deal with the task well. 28

29 Discussions: OSFS vs. Fast-OSFS Compactness: Fast-OSFS is competitive with OSFS Predictive accuracy: our empirical finding OSFS outperforms Fast-OSFS on datasets with a very small sample-to-variable ratio Fast-OSFS is superior to OSFS on datasets with a large sample. 29

30 Discussions: False Negatives To control false positives, two strategies: multiple comparisons and the parameter k. The parameter k is the maximum allowable size that a conditioning set may grow, and dis a key parameter. In online redundancy analysis, multiple statistical comparisons filter redundant features, and find all subsets from BCF to perform multiple tests, and the size of the maximum subset is k. Under the assumption that all independence tests are reliable, with a right value of k, the false positives will be well controlled. Thus,the experimental results show that our algorithms exhibit little sensitivity ii i to false positive ii features 30

31 Conclusion We have proposed a novel framework with two new algorithms to deal with streaming feature selection. Compared with two state-of-the-art algorithms Grafting and Alpha-investing, our algorithms have demonstrated more compactness and better accuracy in supervised learning on databases that contain many irrelevant and redundant features. 31

32 Future work In our experiments, we stimulated the feature set with an unknown but finite size. Explore how to dynamically assess the predictive accuracy with an infinite size, when reaching a certain threshold. Study the impact of stopping criteria i on the OSFS and Fast-OSFS algorithms. Apply online streaming feature selection to real Mars crater data, where craters are represented by thousands of texture-based features that call for efficient feature selection. 32

A Comparative Study of Filter-based Feature Ranking Techniques

Western Kentucky University From the SelectedWorks of Dr. Huanjing Wang August, 2010 A Comparative Study of Filter-based Feature Ranking Techniques Huanjing Wang, Western Kentucky University Taghi M. Khoshgoftaar,