Use of Support Vector Machines for Supplier Performance Modeling

Size: px
Start display at page:

Download "Use of Support Vector Machines for Supplier Performance Modeling"

Transcription

1 Use of Support Vector Machines for Supplier Performance Modeling Roger Dingledine, Richard Lethin Reputation Technologies, Inc. Abstract. Support Vector Machines provide a powerful approach to supplier performance prediction. We explain what they are, how to use them, and introduce some research questions that will allow us to improve the performance of SVMs for the SRMS environment. 1 Introduction Support Vector Machines (SVMs) can be used to model supplier performance in Supplier Reputation Management Systems (SRMS). The SVMs can reliably score, predict attributes of, and classify prospective transactions that the user is considering. SVMs are desirable versus more traditional linear classification and regression models because they can use nonlinear combinations of the input attributes. They perform this calculation only implicitly, though, in the dimensionality of the input (attribute) space. Further, they parametrically control the dimensionality of the model to avoid underfitting and overfitting. Deploying SVMs in early implementations of SRMS is relatively easy. Free, open source software implementations of SVMs are available. On the other hand, there are challenges to applying SVMs to our problem. These research questions are listed in Section 6. The solutions to these questions form the basis for the company to develop secret sauce around SVM implementation and deployment for the SRMS application. 2 What are SVMs? SVMs are a machine learning algorithm for performing classification and regression via a hyperplane in a large virtual feature space. For classification, the SVM is given a set of inputs called the training set, and attempts to automatically determine a hyperplane in feature space that separates these inputs into two classes. The hyperplane allows the machine to make an informed classification on a test vector where the true classification is unknown. Based on the assumption that the test vector and the training set are drawn from the same source, the SVM has predictable bounds on getting the classification of the test vector correct. For regression, the SVM similarly uses training vectors but derives a hyperplane-based function that can estimate a real valued function. One of the things that sets SVMs apart from more traditional linear systems is their use of what is known as a kernel function. Kernels functions which allow the SVM to classify features that are nonlinear functions of the training vector attributes. While it performs this classification in a space of very high dimensionality (the feature space), it only requires computation in the smaller dimensional space of the training vectors (attribute space or input space). The other thing that sets SVMs apart is parametrically controlling the capacity of the SVM (its VC Dimension) to avoid underfitting and overfitting.

2 Generator x Supervisor y Learning Machine y Fig. 1. A model of learning from examples. During the learning phase, the learning machine observes the pairs (x, y) (the training set). After training, the machine must on any given x return a value ŷ. The goal is to return a value ŷ that is close the supervisor s response y. [6, p.18] More formally, SVMs are presented with a series of l training vectors x i R n (that is, each training vector has n attributes) and a series of outcomes y i, one for each training vector. There are two general classes of SVMs: classification and regression. A classification SVM will restrict the y i to binary outcomes (that is, y i {1, 1}), and the SVM derives a function f(x) that can best predict the y i for both the training vectors x 1... x l and for some new set of test vectors x l+1... x l+k for which the corresponding y l+1... y l+k are unknown. Typically, the SVM is trained to minimize the number of prediction errors on the training set. Based on the fact that the machine is forced to generalize, and the assumption that test vectors are generated from the same functional/random relationship as the training vectors, one can plausibly expect a low number of errors on the test vectors. In other words, with a low number of errors on the training vectors, the predictions are likely to be good. This document will not reiterate the theory of SVMs. The paper that introduced SVMs is [1]. Vapnik gives a broad overview of learning theory and SVMs in [6, 5]. A more focused introduction is in [3]. The first article in [4] contains a concise and clear introduction. The documentation for LIBSVM [2] is concise. Generally all researchers in SVMs use a standard notation which we adopt here. 2.1 Nonlinear fitting is more powerful SVMs improve on linear classification algorithms in at least two ways. Firstly, linear classification engines are often limited in the concepts that they can learn and depend for their power on careful feature selection (that is, manual selection of combinations of the inputs into meaningful features). In contrast, SVMs perform, via kernel functions, their classification in a potentially huge-dimensionality space of features that are nonlinear combinations of the input attributes. Secondly, SVMs control their learning capacity relative to the size of the training set via the Structural Risk Minimization (SRM) principle [5, 6]. This capacity control simultaneously forces the machine to reduce both the error stemming from a limited set of samples and from possible overfitting of model to the data. Overfitting occurs when the machine is more complex (higher dimensionality) than the data: the machine simply transfers the training vectors into the structure of the SVM, rather than generalizing patterns from it. SVMs limit overfitting by parametrically setting what s known as the VC dimension of the data so that it matches the complexity of the machine. 2

3 2.2 SVMs automatically find a suitable model SVMs can be legitimately critiqued as being black box learning algorithms that are not tuned specifically to the problem of SRMS. It might in fact be preferable to have a customized statistical model for the SRMS problem. The parts of such a statistical model M would directly correspond to the underlying design, manufacturing, transport, and measurement processes involved in the delivery and assessment of a product from a supplier. M would be a sort of physical equations of motion for the supplier relationship, perhaps incorporating a-priori knowledge about the feasible range of values for unknown or hidden parameters to be solved from observations. However, with limited customer deployments, Reputation probably does not have a strong basis for developing and validating such a model. There is some basis for hypothesizing the structure of a model M - based on world knowledge, knowledge of business 1. But even with these assumptions, the problem of developing M is compounded by the need to match the complexity of M to the quantity of sample data available. With a limited transaction history and a complex M, there may not be enough data to make a strong inference of the parameters of M. Another complexity is that it s possible for M to be structured so as to make solving for its parameters what is known as an ill-posed problem [6] with the parameters of M hypersensitive to small changes in the sample data. Finally, even if a model M were constructed a-priori, there is no guarantee that it would be any good at providing predictive power. SVMs address these problems. The job of the SVM is to find a model. Solving the formulation of the SVM results in a model of the data that is good at predicting the outcome. So, in a sense the solution of the SVM is doing the R&D work needed to develop M. One of the major features of SVM is that they parametrically control the complexity of the chosen model based on the quantity (and complexity) of the training set. This avoids the potential problem of M overfitting the data. The capacity of the SVM (its VC dimension ) is tuned parametrically to match the data. The SVM algorithms are designed so they do not present ill-posed problems. 2.3 OK with modest amounts of data The SVM is OK with modest amounts of data. It tunes the model complexity to the quantity of data. So the risk of a model being too complex to give sound answers at the outset is reduced. SVMs are computationally challenging (but not intractible) to solve. With large quantities of sample data, we will need to improve the SVM solver algorithms and implementations to be able to solve them on our client computer systems (memory, cycles). (See Section 6.1 for a more detailed discussion of the scalability requirements.) But as we start out, the quantities of data will be small. We will be able to implement and deploy basic SVM solvers quickly, because they re available off the shelf. 3 Uses of SVMs for the SRMS environment A classification SVM can be used in the Supplier Reputation Management System (SRMS) in the case where the required prediction is binary. For example, we might want to ask Is this a good transaction or a bad transaction where the training vectors x i are the attribute vectors in the transaction history database and the y i are the historical classification of those transactions as good or bad. Multi-class classification SVMs similarly attempt to categorize prospective transactions based on training vectors, but instead of simply providing a binary prediction, they attempt to predict which of a series of 1 And some of these aspects of the problem can be incorporated into the interface to the SVM; see Section

4 discrete possibilities is most likely. For instance, a multi-class classification SVM might choose between the integers A regression SVM differs from the classification SVM in that it allows the outcomes y i to be in R. The machine then attempts to make estimates of the y i. A regression SVM can be used in the SRMS to predict qualitative outcomes. That is, we might want to ask what is the predicted quality on a scale of 1-10 for the prospective transaction. Again, based on the fact that the SVM is forced to generalize during training, and assuming that the future transactions are generated from the same process as the training transactions, a low rate of errors for the y i in the training set gives good confidence in predictions for the test set. More generally, we might use SVMs to address any of the following problems. 3.1 System Context of the SVM SRMS as part of the corporate process Three parts of the SRMS: data gathering, analytics, decisioning. Interfaces between the three components. Certain information is shared. The Vapnik dictum to avoid solving intermediate problems, versus the client desire to remain in the loop. 3.2 Scoring The general approach to scoring with an SVM in the SRMS environment follows: 1. Obtain a database of historical transactions along with their outcomes. 2. Choose a set of attributes which you feel are relevant to the situation at hand. These are going to be the inputs to the SVM; they should represent all of the factors that influence a score. See section 6.10 for more consideration of how to select attributes. 3. Separate the transactions that are well-formed that is, not missing any attributes (see section 6.2 for consideration of this issue). Each transaction must further also include an outcome what the score of a transaction like this ought to be. These well-formed transactions will be used to train the SVM. 4. Based on the attributes which you want to use for the SVM, construct a training set {(x 1, y 1 )... (x l, y l )} consisting of your l well-formed historical transactions. Each x i is itself a vector of attributes for that transaction, and the y i is the outcome for that transaction. Together, these (x i, y i ) pairs make up your training set. 5. If desired, manually create some features (as further or replacement input dimensions) by combining attributes; see section Scale and/or normalize the x i values so that it s easier for the machine to work with and compare them. See section 6.4 for some ideas for why this is important. 7. Construct a regression SVM, choosing either an ɛ-support Vector Regression machine or a ν-support Vector Regression machine (see sections 2.4 and 2.5 of [2] respectively). See section 6.5 for some discussion of how to determine which is better. 8. Decide on a kernel function and appropriate parameters; see section 6.12 for some help with this. 9. Train the SVM on your training set. This will result in a set of weights α for each training vector; intuitively, this α corresponds to the amount of information that training vector provides to the overall scoring system: how important it is. If all went well, some (hopefully many) of the training vectors will have α = 0, meaning they contain redundant data. The training vectors with non-zero α are known as the support vectors. Keep them and their corresponding α s. 4

5 10. Given a prospective well-formed transaction, you can determine its score by summing the influence of each support vector relative to that transaction. The influence of a support vector is a function of its α (higher α means more influence) and the kernel function when applied to the prospective transaction and that support vector (see section 6.12). The SRMS can simultaneously train on different outcomes. For example, the training could be on outcomes for quality, overall satisfaction, delivery, and so on. While conceptually the training is simultaneous, the SRMS would set up different SVMs for each outcome. If the outcome information is already available in the database, the implementation of this is straightforward - the training vectors use the quantities directly. If the information is not already available in the history, it can be generated virtually, by the client s implementing a function that they consider to represent their criteria for scaling that outcome metric as a function of other metrics. For example, if particular thresholds of delivery lateness are assigned particular values of delivery score, the SVM can be trained on these input scores. We find it interesting how this explicit construction of training outcomes on the output side of SVMs resembles explicit feature construction on the input side. Note that as the operational objectives of the business change, these output functions can be redefined and the SVM retrained to reflect the new outcome prospects. 3.3 Recommending whether to buy SVMs can also be trained to make a purchasing decision in this case, whether to buy or not. As in the previous section, the SRMS historical transaction database is used to form the x i for the training set this includes selecting attributes, explicitly constructing features, and normalizing all attributes and features to the range of 0.0 to 1.0. However, in this application, the y i for the training set are set to be equal to the proper decision for that historical transaction: whether, in retrospect, the purchaser should have bought. Since this is a discrete outcome to be learned, a classification SVM (?? in the LIBSVM) is used. Now, faced with a prospective transaction x i (the test vector) for which y i is unknown, the trained SVM can give a prediction ŷ i for that transaction, i.e., a decision on whether to buy. As mentioned in Section 3.2, if this outcome decision y i is not directly available in the historical transaction database it can be constructed virtually. Framing the use of the SVM to make decision recommendations goes beyond what some clients have indicated they want from the SRMS. One client stated that they don t want the decision made for them and in fact are skeptical that a machine can make this decision for them. For the SRMS, they want the analytical portion of the SRMS to present data in a way that makes the history clear and supports the human decision on purchase. This customer requirement would tend to rule out the use of SVMs making the recommendation of whether to buy, but this force is countered by the theoretical dictum around statistical learning theory: When solving a given problem, try to avoid solving a more general problem as an intermediate step [6, p. 30, author s emphasis]. Integrating decisioning making the buy/no-buy recommendation with the analysis is the preferred course, according to Statistical Learning Theory. think more about this tension. 3.4 Predicting other attributes about the prospective transaction In general, the arrival of attributes to a transaction is not simultaneous. Information about shipping arrives after the order is placed; quality measurements are made as the product is initially evaluated and then moves through the factory and into the field. 5

6 The late-arriving attributes may be predictable from the earlier attributes. This really means that the difference between an attribute and an outcome is becoming blurred: we can build an SVM to predict outcomes, but we can similarly build an SVM to predict unknown attributes for a prospective transaction. 3.5 Ranking Suppliers and Pairwise comparison between two suppliers indicating which is more appropriate for that transaction. 3.6 Flagging high-loss-potential situations Statistical learning theory is framed as a loss minimization problem, where the loss function is the expectation of loss based on estimating the outcomes of the function on which the learning machine is trained. this is well-aligned with the corporate objective of increasing profit, which is simply the inverse of reducing loss, so it should be possible to train not only on score but to weight by the transaction risk 3.7 Flagging when a supplier is not behaving as his average by doing linear regression as well as SVM prediction, widely different answers can indicate a trend or other situation that a human should look at. The trend may indicate either a change in behavior of the supplier or a deviation of the model. 3.8 Explanation of scores Because the SVM uses certain training vectors as benchmarks, it can indicate which previous transactions the transaction in question is like. But see section Gaming detection (including how much gaming risk is detected) 4 Reliability of SVMs insert the equations for the bounds on the expected error from SVM theory. 5 Advantages of SVMs for SRMS a-priori model selection is difficult lacking access to actual supplier rating data. non-linear combinations of the inputs are very plausible explanatory features in the SRMS model. (Basically, any explanation with an and in it falls into this category.) statistical learning theory based formulation of the learning problem as risk minimization is aligned closely with the objectives of the client corporation. This seems to be the reason why many problems that seem to face biz can be formulated naturally in the framework. theoretically defensible computationally intense, but the rate of problem submission is not high. (But precomputation or some other engineering approach is going to be needed to provide interactivity.) 6

7 6 Problems and Research Areas 6.1 Scalability Solving a SVM is a large quadratic optimization problem. With small amounts of data, the SVMs will be solvable with off-the-shelf technology, like LIBSVM, or even MATLAB based formulations of SVMs. If the beta deployments of a SVM/SRMS work on small amounts of data, we will obviously want to make them scale. The key point is that if they work on small amounts of training data, they can only do better with more data. Scaling SVMs will be hard work, but this is an opportunity for our technical prowess to differentiate ourselves positively from any competition. The opportunities for scaling the SVMs are as follows: We should examine the structure of the working SVM and the data to determine if there is a simpler model M that can shortcut the learning procedure. In other words, if the SVM works well, we might be able to develop an alternate representation that bypasses the model search portion of the SVM task and instead moves to the parametric extraction. Focusing of input attributes. If there are input attributes that are clearly irrelevant they can be removed. Preprocessing to add essential features. Pruning the training set if we have a large number of redundant or information-free data points (that is, their α is 0), then we might be able to remove them from the training set to speed things up. Floating point and Java implementation to speed floating point. Parallel computing. 6.2 Data completeness Section 3.2 describes a transaction as well-formed if it includes values for all of the attributes that we use as input dimensions to the SVM. However, in reality our database of historical transactions will be incomplete it will have gaps, perhaps over certain timeframes or perhaps with other patterns. If the training vectors do not have an associated outcome, then it seems clear that we cannot learn anything from that training point. But how well can we deal with training vectors that include an outcome but don t have all their attributes filled in? One approach is to attempt to interpolate a value for that attribute but this interpolation assumes that we already have a model for how the data ought to behave, else we have no way to generate what the value should be. If there are patterns of gaps (eg, one large set of data is simply missing one attribute and is otherwise complete) we could use a separate SVM that trains and answers queries without considering that attribute. Alternatively, we could build a separate SVM to attempt to predict the values of missing attributes. This may be more trouble than it s worth. The best general approach that we have so far is to throw out incomplete data. 6.3 Data quality How well can we deal with training sets that include errors? The so-called soft margin SVMs are designed to handle the case where some of the training points are misclassified, by introducing a slack term to the optimization problem (representing how far on the wrong side of the hyperplane that point is) and trying to minimize the slack for each training point. 7

8 We need to develop a stronger methodology for determining how many errors the training set can include before the resulting predictions stop being useful. Perhaps we can learn how to detect patterns of misclassification in the training set. We also need to learn more about the stability of the SVM s predictions based on changes in the training set. Can we dramatically alter the prediction considerably by adding or removing a few training points? Is that ok? 6.4 Learn more about input peculiarities Experimentally, it seems that the SVM performs more accurately if the training vectors are scaled and normalized. Why? What factors in the SVM s internal design (or implementation?) contribute to this requirement? Are there other aspects of the data that would increase accuracy if they are normalized or scaled? Why? 6.5 How do we know when it s working? How do you compare two trained SVMs to decide which is better? How do you know when an SVM is good enough that it produces acceptable predictions? 6.6 How do SVMs behave in the presence of dishonesty or manipulation? We need to research the effects of directed and undirected attempts at manipulation on the SVM procedure and data. How resistant is the SVM to a subset of the training set that is carefully engineered to influence the hyperplane calculation? In particular, how hard is it to calculate additional training points which push the hyperplane just far enough to misclassify a given prospective transaction? As a separate question, can we build an SVM which is specifically designed to detect gaming in the underlying data set? Perhaps we can build (or add to the data set) training vectors which explicitly represent gaming attempts (we would call these new vectors attack signatures). The SVM would then learn to recognize these attack signatures on newly arriving data, and might be able to flag new prospective transactions as carrying some attack signature. 6.7 Adding time-variant behavior to SVMs More generally, it seems difficult to give SVMs a series of training points as input at once. To notice timebased patterns, the SVM might need to learn the significance of an extra timestamp input. How hard is it to make this work? How important is it in producing reliable and robust predictions? The question of modelling time-variance in an SVM is an ongoing research problem in the literature. 6.8 English explanation of the SVM s decision One of the major flaws of the SVM approach to modelling is the fact that the SVM acts as a black box as far as being able to manually repeat or understand its output. This leads to mistrust of the machine s performance by those who will use it. What possibilities are there to try to output more intuition along with the machine s prediction? For instance, can we modify the machine to justify its prediction eg, by enumerating the points that are nearby in feature-space, ideally along with a description of why the two points are related? 8

9 6.9 What does confidence mean to an SVM? The black box property of the SVM is made much more dangerous because the SVM has no way to detect if a prospective transaction is wildly different from the transactions used for training. That is, the SVM has no way to output I don t know for such transactions. We could develop a separate process for detecting distance of the test vector from each support vector or training vector but it s not clear how to determine when the point is far away because it s on the borderline (near the hyperplane), or far away because it s very clearly on one side of the hyperplane. Since we only use the support vectors for making predictions, it s not clear that a straight distance measurement would be meaningful at all. Note that confidence of a machine is different from confidence of a data point an SVM s ability to learn a given training set can be described either as the number of errors it makes in classifying the training set, or in the number of errors it makes in classifying a separate test set (a set designed specifically to see how well the SVM learned how to classify in that situation). Traditionally, the confidence in a given SVM has been measured by this approach after all, one of the assumptions is that the training set comes from the same distribution as any future questions. Yet this presents a fixed value for all possible inputs to the machine. Clearly more thought needs to be given to this Selecting attributes So which attributes are we going to use, anyway? How can we pick them? 6.11 Hand picking some features Perhaps based on clustering or a closeness metric, to do some of the business work of picking relevant feature dimensions for the machine What kernel function(s) should we use? say, gaussian? (no, i don t know why.) plus we need to pick some good parameters Can we pick kernels in this manner: compute the error over the training set for kernels K 1, K 2,.... Then, select the kernel that gives the best answer. (This seems to be a violation of the spirit of the SVM. Specifically, suppose that the number of kernels to be chosen were infinite. Then one can imagine picking the kernel function that gives a perfect fit for the data. There s no basis for believing that this is generalizing. The selected kernel could be one that rote-learns the data. How is this problem manifest in the assumptions involved in SVMs? This inconsistency seems to imply that an a-priori selection of kernels is necessary. In that case, one might decide to choose the kernel based on the nature of the problem domain Network Formulation Can we use an svm on a web of trust? How well can it learn in the face of potential pseudospoofing and other attacks? How effective would it be on evaluating the Advogato trust network? How can we tell if it worked? 9

10 6.14 Constraining the SVM to be defensible to suppliers 6.15 Business examples/arguments of using SVMs 6.16 Assumptions Underlying the SVM The correct performance of the SVM relies on the modeling assumption that the training set and the test set are generated from the same probability distribution. There are two parts to this (1) that the training examples are representative of the test set. (2) that the relationship between the training attributes and the training outcome is representative of this relationship in the test set. If the client s underlying corporate processes change, the training examples could cease to be representative. Examples include the corporation starting to purchase new types of items, and the supplier changing its business conditions. (Need to relate this to the business risk estimation problem in Section 3.6.) (Also, need to think about the issue where the estimator is used to drive decisions, which changes the framing of the purchases.) (Consider this: if the prospective transactions are labelled bad and they don t go through, the system could degenerate because the representativeness of the transaction history could diminish, increasingly lacking the training vectors that indcate bad transactions. Some of this could be engineered into the system, where prospective transactions which are rejected should go into the database. This seems like an important general engineering consideration for all scoring. Is there a way of structuring the Java code so that this correctness requirement is embodied in an Interface or Pattern so that it s not forgotten?) References 1. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages ACM Press, Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines (version 2.3). Department of Computer Science and Information Engineering National Taiwan University, available on the web. 3. Nello Cristianini and John Shawe-Taylor. Support Vector Machines and other kernel-based learning algorithms. Cambridge University Press, Bernhard Scholkopf, Christopher J.C. Burges, and Alexander J. SMola, editors. Advances in Kernel Methods. MIT Press, Vladimir N. Vapnik. Statistical Learning Theory. Wiley, Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science. Springer, New York,