An Interactive Exploration of Machine Learning Techniques By: Michael Mundt of Revolution Capital Management

Size: px

Start display at page:

Download "An Interactive Exploration of Machine Learning Techniques By: Michael Mundt of Revolution Capital Management"

Alvin Singleton
5 years ago
Views:

candidates, quickly led one into the trap of over-fitting.

1 An Interactive Exploration of Machine Learning Techniques By: Michael Mundt of Revolution Capital Management We last discussed machine learning in our Q newsletter. After acknowledging the appeal of such an approach for building trading models, we nevertheless concluded that the weak predictability of most models, coupled with the ease of generating model candidates, quickly led one into the trap of over-fitting. Freedman s Paradox, discussed in that newsletter, nicely encapsulates the conundrum by describing the nearguarantee of finding a back-tested model with good performance when the number of model candidates is on par with the number of data points available for building such models, even when the models are known to possess no skill. Two years on, the lack of performance within the overall CTA space has generated continued interest in this data-intensive approach (presumably in search of a better, more consistent edge than what trend-following models have provided in recent years). Some large CTAs have dedicated entire research units to machine-learning techniques, and while there is understandably much promise in these approaches, there is also much reason to remain skeptical. The low signal-to-noise ratio in predictive trading signals, coupled with the relative dearth of data, makes the identification and extraction of such signals quite difficult. In contrast, a recent article in The Economist outlined a machinelearning approach for improving retail supply chains and noted that it learned based on a data set of 3 billion transactions. This number of available samples is truly stupendous and is many orders of magnitude greater than a data set applicable to shortterm futures trading. In the 2015 newsletter, we also alluded to our own prior efforts at automated model building and selection and noted that Freedman s Paradox quickly becomes one of the primary obstacles in building a trading system via this method. Nonetheless, much of our argument was presented in the relatively dry language of mathematical statistics. In order to better illustrate the difficulty with such approaches to system building, we decided to revisit the topic by introducing a simple game. The game can be played interactively and only requires a click of a mouse button. For those readers who have access to the software application MATLAB, we are offering a free copy of the tool so that this can, in fact, be tried at home. The tool is unimaginatively called dataminer and the idea is as follows. We assume that we have an existing framework with which to generate a large population of possible trading models. The structure of this framework is not important for our purposes. We simply require that the framework generate a variety of distinct models and evaluate the profitability of each model using the same data set each time. The profit potential of each model will be measured by computing the Sharpe ratio (mean return divided by root-mean-square deviation of the returns) and annualizing the value to provide an effective yearly Sharpe ratio for each candidate model. The dataminer tool ignores the details of the model generation framework and mimics only the final output of this process by generating Sharpe ratio distributions with specific means (corresponding to either skilled or unskilled models), thus avoiding the necessity of building actual trading models. This allows one to focus only on the essential task of deciding, based solely on past performance, which models to keep and which to discard. Some may argue that our abstraction of the model-building process is too great and that the approach of a simultaneous batch evaluation of a large number of models underestimates the iterative, value-added adaptation processes utilized by machine learning algorithms. We disagree and posit that the abstraction perfectly captures any procedure that attempts to build models via data. Sequential processes discard numerous models as they attempt to improve their ability using the training data. Batch approaches generate a large population of models at once and make a single-pass decision on model quality. The details may differ but all efforts to build data-driven models eventually must rely on statistics to separate good models from bad based on past performance. To run a simulation, the user needs to specify four inputs. The first input is the number of models to test, while the second is the likelihood that a model has skill. For example, if the user specifies a search space of 1000 models, and a probability of 0.05 that a model has skill, then the tool will generate 950 models with no skill and 50 models with the desired amount of skill. The skill level

2 itself is the third input and is characterized by the user-defined Sharpe ratio. One necessary complexity (for reasons that will become clearer shortly) is that the Sharpe ratio can be specified on a per-model basis or an overall basis. With the former setting, each of the 50 skilled models in our example will have the specified Sharpe ratio, so that a system that includes only those skilled models would have a Sharpe equal to (50)1/2 times the permodel Sharpe ratio (i.e. the usual square-root diversification that arises from uncorrelated models). However, if the overall Sharpe is specified, then the Sharpe of all 50 models together will equal the user-specified value, meaning that each individual model (in the example above) would have a Sharpe equal to 1/(50)1/2 of that value. values. When the simulation is run, the user is presented with a graph similar to that shown in Figure 1, along with a request to click on the graph where the distinction should be drawn between skilled and unskilled models. The graph shows the performance of all 1000 models using the training data. However, the user is not told which of the results are based on skillful models and which are the outcomes of the unskilled models. For now we only need to specify one more input, and that is the number of training points available for evaluating each model. Lacking any known, immutable laws governing price dynamics, data is the life blood of systematic model development, and yet it is always in short supply. However, for our first example, let s consider an ideal situation where the relative frequency of skilled trading models is high, the Sharpe ratio of each individual model is reasonably good, and where an abundance of data exists. This set of parameters is instructive in showcasing an ideal model-development scenario and sets the stage for more difficult, and realistic, cases to follow. We will set the number of candidate models to 1000 and assume that the probability of finding a skilled model is 0.15, thus providing us with a population of 150 skilled models and 850 random models. To help differentiate the two model types, we will evaluate performance of each model using 25,000 data points (we can consider each data point to be a daily return from that model, implying that we have about 100 years worth of data for each model). Finally, we will assume that each skilled model has an annualized Sharpe ratio of The parameter choices are summarized in Table 1. PARAMETER VALUE NUMBER OF MODELS 1000 PROBABILITY OF SKILLED MODEL 0.15 SHARPE OPTION PER MODEL SHARPE PER MODEL 0.75 NUMBER OF TRAINING POINTS 25,000 TABLE 1: SUMMARY OF PARAMETERS FOR SCENARIO 1 The goal of the simulation is to see how well, based on the output from the training data, the user can distinguish between skilled and unskilled models that are generated with the user-defined FIGURE 1: SHARPE RATIOS OF ALL 1000 MODELS BASED ON TRAINING DATA, SCENARIO 1 The user thus needs to make a decision on a performance threshold that will be used to separate skilled and unskilled models. Any models whose historical performance is greater than the cutoff (i.e. to the right of the chosen value) will be used in the system on a walk-forward basis (using previously unseen data), and any model with historical performance less than the cutoff will be discarded. Figure 1 shows two clear distributions of Sharpe ratios, one centered around a Sharpe value equal to zero and the other centered around a Sharpe value of about 0.8. Most likely, these two distributions correspond to the unskilled and skilled models, respectively, which have true means of 0 and A natural way to distinguish these two distributions is to separate them near Sharpe=0.4. We will hypothesize that any model with a Sharpe greater than this cutoff is a skilled model (whose sample performance may be less than 0.75 but whose long-term performance will be asymptotically equal to 0.75). Any model with a Sharpe less than this is presumably an unskilled model with a true Sharpe equal to 0 (and any deviation from zero is simply due to random variability given the finite sample size).

Clicking on the graph subsequently produces a vertical line where the click is made and additionally re-colors the graph to reveal the unskilled models in blue and the skilled models in green.

3 Clicking on the graph subsequently produces a vertical line where the click is made and additionally re-colors the graph to reveal the unskilled models in blue and the skilled models in green. This provides the user with feedback on the skill of the chosen cutoff value. Figure 2 shows the results for this case. It is clear that a cutoff of 0.4 is in fact a good choice. The tool also provides output of how many skilled models are correctly kept, and how many unskilled models are inadvertently retained as well. In this example, our accuracy is perfect. We keep 150 models, all of them skilled, and discard all 850 unskilled models. FIGURE 3: EXPECTED SHARPE VERSUS CHOSEN CUTOFF, SCENARIO 1 keeping most or all of the skilled models while introducing few (if any) unskilled models. Choosing a cutoff far to the left means keeping most or all models, and the Sharpe approaches the naïve value. Choosing a cutoff successively farther to the right eventually results in discarding all models, and the Sharpe hence approaches zero. FIGURE 2: AFTER CLICKING ON THE CHOSEN CUTOFF, THE CHOSEN VALUE IS SHOWN BY THE RED DASHED LINE AND THE SKILLED AND UNSKILLED MODELS ARE REVEALED IN GREEN AND BLUE, RESPECTIVELY (SCENARIO 1) In this example, the best Sharpe that we can achieve in the long run is (150)1/2 x 0.75 = 9.2 (for finite sample sizes the measured value will be slightly above or below this). Figure 3, which is another output from the tool, shows the Sharpe value achieved vs. the cutoff value that is chosen. The red dashed line shows what is termed the naïve Sharpe ratio. This is the value that would result from keeping all 1000 models, not making any effort to distinguish skill from chance. In this case, only 150 models provide a positive return, but all 1000 models add to the variability of the returns, thus diluting the Sharpe ratio. The naïve Sharpe ratio is 3.56 for this simulation. The black dashed line shows the maximum possible value of 9.2, and the solid blue line shows the Sharpe obtained vs. the chosen cutoff value. As is evident, for any cutoff value between 0.2 and 0.6, we will achieve a Sharpe ratio close to the maximum value. This makes intuitive sense. Given the clear distinction between the distribution of unskilled and skilled models, any cutoff value chosen in between the two will generally result in Let s now consider how the results change when we have far less data with which to evaluate the skill of the candidate models. Keeping the other parameters unaltered, we will now run a simulation where we have only 2500 data points available. The parameters are summarized in Table 2. Figure 4 shows the distribution of Sharpe ratios, and it looks surprisingly different than Figure 1. Instead of two discrete clusters, there is instead one big cluster with a slight hump on the right. And in this case, we have an additional advantage that doesn t exist when building real-world models, and that is the fact that we know what the Sharpe ratio of the skilled models is. Even though we can t see the cluster of skilled models very well, we know that we should choose a cutoff of around 0.40 in order to separate the skilled models (with an average Sharpe of 0.75) and unskilled PARAMETER VALUE NUMBER OF MODELS 1000 PROBABILITY OF SKILLED MODEL 0.15 SHARPE OPTION PER MODEL SHARPE PER MODEL 0.75 NUMBER OF TRAINING POINTS 2,500 TABLE 2: SUMMARY OF PARAMETERS FOR SCENARIO 2

FIGURE 6: EXPECTED SHARPE VERSUS CHOSEN CUTOFF, SCENARIO 2 FIGURE 4: SHARPE DATA, SCENARIO 2 RATIOS OF ALL 1000 MODELS BASED ON TRAINING have retained those would have been to move the cutoff farther

4 FIGURE 6: EXPECTED SHARPE VERSUS CHOSEN CUTOFF, SCENARIO 2 FIGURE 4: SHARPE DATA, SCENARIO 2 RATIOS OF ALL 1000 MODELS BASED ON TRAINING have retained those would have been to move the cutoff farther to the left, also ensuring the inclusion of more unskilled models. models (with an average Sharpe of 0) the best that we can. Figure 5 shows the results, and they are quite striking. The skilled population, shown in green, actually extends quite far to the left of the cutoff. The unskilled population, shown in blue, similarly extends to the right of the cutoff. Our choice of cutoff has resulted in keeping 217 models, and of those 127 (58.5%) are skilled while 90 (41.5%) are unskilled. This means that we also unfortunately discarded 23 skilled models, but the only way to Figure 6 shows how well we would do, on average, for each chosen cutoff value. The optimal value (i.e. the one that maximizes the expected Sharpe ratio) is around 0.5, and this provides a Sharpe of about 6.8. This is a critical result, as it reflects the fact that our limited information prevents us from achieving the maximum theoretical Sharpe ratio of 9.2. Our reliance on past performance to separate skilled and unskilled models means that our discriminatory power is limited by the quantity of data. It s not just a matter of making a better choice of the cutoff value; no value can separate the two populations well enough to obtain a Sharpe value greater than 6.8. Figure 7 shows the evolution of profit for both the ideal and chosen systems. The time scales are sufficiently long that the undulations around the average rate of profit are not visible, but the effect of necessary model fitting/selection is still clear. The solid blue line shows the performance of the group of skilled models during the training period, while the dashed blue line shows their performance during a subsequent evaluation period. These two lines have exactly the same slope, as expected. FIGURE 5: UNVEILED RESULTS, SCENARIO 2 The green line shows the performance of the chosen models during the training period. In this case, we chose models that performed well. The effect is that the Sharpe ratio of the green curve is about 10.7, which is above both the ideal Sharpe of 9.2 and the achievable Sharpe of 6.8. Because we are choosing the outperformers, and because some of this outperformance is simply random luck, our in-sample results overstate our expected

5 performance going forward. The red line shows the performance of our chosen models during the evaluation period. The Sharpe ratio of this curve is 6.4, which is just slightly less than the maximum achievable value of 6.8. Had we chosen a less-optimal cutoff value, this Sharpe could be substantially less. In any case, there is a clear kink where the green and red lines intersect; this kink reflects the necessary uncertainty of choosing models based on limited and noisy results. FIGURE 7: PROFIT VERSUS PERIODS) OF THE SKILLED SCENARIO 2 TIME (FOR BOTH TRAINING AND EVALUATION MODEL SET AND THE CHOSEN MODEL SET, Based on the two prior examples, we have seen how the ability to choose skillful models, and to separate them from random but lucky impostors, is degraded when the quantity of data is limited. Let s now turn to examples that better reflect real-world constraints; this in turn first requires a discussion of the various input quantities. The easiest input to estimate is the number of data points available with which to evaluate the models. Most systems will have holding periods on the order of a day or longer, so looking at intra-day-based sampling is not relevant; we can justifiably assume that performance sampling is daily. This provides roughly 250 samples per year per market. Next, although many futures markets have existed for decades (e.g. grains), markets have changed significantly over time. One of the biggest changes occurred with the transition to electronic trading, which began in the early to mid 2000s. If we assume that market dynamics prior to this period may no longer be representative of the current state of affairs, and thus restrict ourselves to this more-recent epoch, we have no more than fifteen years worth of data per market. Furthermore, we likely want to retain some data for out of sample testing. Let s thus assume we have 12 years of data, or 3000 points, per market available for training. Finally, with respect to data quantity, it is important to note that including more markets does not change the number of samples available to us if we are looking at overall system performance. In theory, we could lengthen the data set by treating each daily return of each market as a separate data point rather than aggregating performance across markets for each day. However, we would then trade off a larger sample size (thus reducing the random spread of performance) with the search for lower Sharpe ratios (because it s the aggregation of per-market returns on a daily basis that generally improves Sharpe values quite substantially). Since our aim is to build a system that trades identically across markets, we will assume that we have 3000 training points and then adjust our expectations for model performance based on a full, multi-market trading system. To that end, we now turn to a discussion of realistic expectations for model quality and quantity. Our best insight comes from examining the returns of various CTA styles, and their similarities and differences, over the past decade or more. Generally, the best long-term Sharpe ratio has been right around 1. In addition, looking at the different return drivers common among CTAs (this can be done using principal component analysis), the data suggests that there is a small and finite number of skilled models being used. This clearly doesn t prove that a larger number doesn t exist, but it certainly suggests that only a limited number of long-lived, generic patterns are able to be extracted from the available data. For this reason, it makes the most sense to assume that our choice of Sharpe value in the tool should apply to the overall system and not to each skilled model. What should this Sharpe value be? Again, our extensive work on model identification suggests that, if the only constraint is to build the highest-sharpe system, and if we can assume that we can include models with limited capacity due to high trading frequencies, a Sharpe of 2 is probably an upper bound. If that sounds too optimistic, keep in mind that this is likely larger than what can practically be obtained given the limited size of the data set. In terms of the probability of finding a skilled model, common sense suggests that this value should be far less than 1. Complicating this discussion, we have found that the likelihood of finding a skilled model is not vanishingly low, but more often than not the skilled models themselves are similar. In other words, many of the skilled models are exploiting the same dynamic or a combination of a few dynamics. This complexity is not provided for in our simple tool, so we need to find a way to approximate its effect.

6 From prior research, we estimate that there are roughly 5 independent return streams that can be extracted from the available data. Also based on experience, we have found that a reasonable search space of 500 models holds one example of each return stream, on average. Including 1000 models would result in finding (on average) two of each return stream, which would simply result in the user unearthing duplicates. Because our tool is too simple to accommodate duplicates in a statistically correct manner, we will run the simulation using 500 models and a probability of 0.01 of finding a skilled model. This will provide one instance of each independent return driver for our simulation. Finally, given a maximum possible Sharpe of 2 and 5 independent models, we note that the Sharpe ratio for each skilled model will be about 2/(5)1/2 = PARAMETER VALUE NUMBER OF MODELS 500 PROBABILITY OF SKILLED MODEL 0.01 SHARPE OPTION OVERALL SHARPE PER MODEL 2.0 NUMBER OF TRAINING POINTS 3,000 FIGURE 9: UNVEILED RESULTS, SCENARIO 3 TABLE 3: SUMMARY OF PARAMETERS FOR SCENARIO 3 Table 3 recaps the parameter settings for our third experiment, while Figure 8 shows the histogram of Sharpe ratios from the training data. As with our prior example, there is no clear dividing line between the two distributions, only a slight fatness to the FIGURE 10: EXPECTED SHARPE VERSUS CHOSEN CUTOFF, SCENARIO 3 right tail of the single, visible distribution. Based on the asymmetry of the distribution, we choose a cutoff close to Figure 9 shows the results and indicates that we did pick a number of the skilled models but also included some unskilled models while rejecting some of the skilled ones. Specifically, we kept 8 models, 3 of which (37.5%) are skilled and 5 of which (62.5%) are unskilled. FIGURE 8: SHARPE RATIOS OF ALL 500 MODELS BASED ON TRAINING DATA, SCENARIO 3 Figure 10 shows the optimal cutoff, which is indeed at about 0.75 (meaning our guess was quite lucky). Given a naïve Sharpe of 0.20 and an ideal Sharpe of 2, the best choice of cutoff gives us an

7 usually highly correlated, we can simply assume we have four equally profitable but independent profit sources. Based on our previous estimate of an overall potential Sharpe of 2, each of the four sectors would in this case have a potential Sharpe of 1. However, our goal in a sector-specific model construction is to improve on a pan-market approach, so let s assume that we believe a Sharpe of 1.5 can be achieved per sector, thus allowing a full-system Sharpe of 3. The other parameters remain unchanged from the prior example, as we still have the same number of data points available and we still expect the same density of skilled models within the overall population. Table 4 shows a parameter summary. Figure 12 shows the distribution, while Figure 13 unveils the true distributions along with our cutoff choice of about 0.7. PARAMETER FIGURE 11: PROFIT VERSUS TIME (FOR BOTH PERIODS) OF THE SKILLED MODEL SET AND SCENARIO 3 TRAINING AND EVALUATION THE CHOSEN MODEL SET, achievable Sharpe of 1.3. Figure 11 shows the in-sample and outof-sample profit curves for both our model choices and for the ideal case where we are able to keep only the skilled models. The solid blue and dashed blue lines show the ideal system for both the training period and a subsequent evaluation period. Because the system never changes, this profit curve consistently reflects a Sharpe ratio of 2. Our chosen models for the training period exhibit a Sharpe of about 2.8, while for the evaluation period they realize a Sharpe of 1.0, which reveals the optimistic bias that can happen when evaluating models based on only training data. Despite having to approximate a method for dealing with the real-world complexity of potentially uncovering correlated, skilled models, the result from our simple tool is in line with what the best (generally multi-strategy) systems have done over long time periods. It is important to remember that one doesn t know the true probabilities when building trading models. The optimal cutoff choice is thus not known and is unlikely to be chosen exactly. Given this unavoidable lack of information, one might expect this optimally-realizable Sharpe of 1.3 to result in a Sharpe closer to 1 (e.g. by choosing a cutoff of 0.6 instead of 0.75). In this particular case, we would thus realize only about a 50% efficiency in extracting skillful models (i.e. ideal Sharpe = 2 but realized Sharpe = 1). Let s finally consider the case where we wish to extract models on a per-sector basis. Since we can decompose a diversified market set into four primary sectors (equities, bonds, currencies and commodities), and since dynamics within each sector are VALUE NUMBER OF MODELS 500 PROBABILITY OF SKILLED MODEL 0.01 SHARPE OPTION OVERALL SHARPE PER MODEL 1.5 NUMBER OF TRAINING POINTS 3,000 TABLE 4: SUMMARY OF PARAMETERS FOR SCENARIO 4 FIGURE 12: SHARPE DATA, SCENARIO 4 RATIOS OF ALL 1000 MODELS BASED ON TRAINING In this case, we keep 5 models, but only 2 are skilled while 3 are not. Our in-sample Sharpe is about 1.9 but during the evaluation period it is only Looking at Figure 14, we have chosen well

FIGURE 15: PROFIT VERSUS TIME (FOR BOTH PERIODS) OF THE SKILLED MODEL SET AND SCENARIO 4 FIGURE 13: UNVEILED RESULTS, SCENARIO 4 TRAINING AND EVALUATION THE CHOSEN MODEL SET, instead of four

8 FIGURE 15: PROFIT VERSUS TIME (FOR BOTH PERIODS) OF THE SKILLED MODEL SET AND SCENARIO 4 FIGURE 13: UNVEILED RESULTS, SCENARIO 4 TRAINING AND EVALUATION THE CHOSEN MODEL SET, instead of four sector-specific efforts. One could argue that the expected overall Sharpe should be higher since per-sector tuning may allow one to create models to better exploit sector-specific dynamics. However, if this is not the case then the effect can be extremely detrimental. If we run a simulation where the sector-based Sharpe is equal to 1, implying that there is no sector-specific model structure, then the best we can do on a per-sector basis is extract Sharpe = 0.20, which equates to 0.4 for the overall system. This is a severe reduction from the value of 1.3 we were able to achieve by aggregating all of the available data before choosing models. FIGURE 14: EXPECTED SHARPE VERSUS CHOSEN CUTOFF, SCENARIO 4 as the best long-term expected Sharpe is Figure 15 shows the profits during the in-sample and evaluation periods. The kink between the green and red lines is now extremely pronounced. Since this is a sector-level result, we would expect our system (which trades four sectors) to exhibit a Sharpe twice this value, or around 1.3. Nonetheless, even though we allowed the sectors to have better underlying model skill, our ability to extract it was sufficiently degraded that we could not improve on our overall Sharpe from the prior example, which only required one search We briefly want to touch on the effect of having a larger model pool to consider. Naively, it makes some sense to consider a larger population of models. This, however, is only true under certain conditions. First, it depends on whether there is a finite, and rather small, number of independent and predictable alphas or whether there is a large number that simply requires more creativity and effort to uncover. If the former is true, then the best one can do by extending the search space is to find duplicates of previously-uncovered predictors. At worst, if the additional searches are of low quality (e.g. exploring data sets with no known connection to futures markets), then diluting the percentage of skilled models in the population actually decreases the ability to extract the skill. If the number of alpha sources is instead large then the combined performance histories of CTAs implies that many of these alpha

9 sources must be fairly weak. Otherwise, these alpha sources would manifest themselves in a much higher diversity of performance than what is seen when comparing and contrasting track records. And if they are weak, then separating them from random outcomes is that much more difficult. A quick test shows that, with 3000 training points, predictors with Sharpe values less than 0.25 are nearly impossible to distinguish from the noise. We have long advocated, and continue to pursue, idea-driven model building. Rather than blindly adopting a model whose return drivers are essentially unknown, idea-centric models generally make sense and their performance can be attributed to a specific risk-seeking or risk-avoidance behavior by investors. Moreover, if the idea is sound, then any number of model implementations should produce a similar return stream, thus allowing another test to guard against being fooled by randomness. No approach is foolproof; lacking theory and also likely lacking immutable laws of behavior, we can only peer into the past and attempt to find evidence of patterns that we then can only assume will persist into the unwritten future. Michael Mundt is one of the founding partners of Revolution Capital Management, a leading short-term systematic investment manager headquartered in Denver, Colorado. He and his co-founding partner, Rob Olson, both graduated with Masters and Doctoral degrees in Aerospace Engineering from the University of Colorado and worked in the technology and defense sectors prior to launching Revolution. Michael and Rob continue to live in the Denver area with their families.