Linear, Machine Learning and Probabilistic Approaches for Time Series Analysis

Similar documents
Machine Learning Models for Sales Time Series Forecasting

XGBOOST AS A TIME-SERIES FORECASTING TOOL

Linear model to forecast sales from past data of Rossmann drug Store

Top-down Forecasting Using a CRM Database Gino Rooney Tom Bauer

1/27 MLR & KNN M48, MLR and KNN, More Simple Generalities Handout, KJ Ch5&Sec. 6.2&7.4, JWHT Sec. 3.5&6.1, HTF Sec. 2.3

Scalable Mining of Social Data using Stochastic Gradient Fisher Scoring. Jeon-Hyung Kang and Kristina Lerman USC Information Sciences Institute

Towards a Sustainable Food Supply Chain Powered by Artificial Intelligence

Statistics, Data Analysis, and Decision Modeling

Wargaming: Analyzing WoT gamers behavior. Nishith Pathak Wargaming BI 14/03/2018

Inference and computing with decomposable graphs

Data Mining. CS57300 Purdue University. March 27, 2018

ECONOMIC MODELLING & MACHINE LEARNING

Risk-Lab. Portable Monte-Carlo Simulation of MS Excel models RISK-LAB

Predictive Analytics

Prediction and Interpretation for Machine Learning Regression Methods

From Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here.

Adaptive Time Series Forecasting of Energy Consumption using Optimized Cluster Analysis

Data Analysis Boot Camp

arxiv: v1 [cs.lg] 13 Oct 2016

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. ENTERPRISE MINER: ANALYTICAL MODEL DEVELOPMENT

Predicting Purchase Behavior of E-commerce Customer, One-stage or Two-stage?

Probabilistic load and price forecasting: Team TOLOLO

Generative Models for Networks and Applications to E-Commerce

Bank Card Usage Prediction Exploiting Geolocation Information

Jialu Yan, Tingting Gao, Yilin Wei Advised by Dr. German Creamer, PhD, CFA Dec. 11th, Forecasting Rossmann Store Sales Prediction

Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong

Masters in Business Statistics (MBS) /2015. Department of Mathematics Faculty of Engineering University of Moratuwa Moratuwa. Web:

Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy

Correlation between chlorophyll-a and related environmental factors based on Copula in Chaohu Lake, China

Intro Logistic Regression Gradient Descent + SGD

Methods and Applications of Statistics in Business, Finance, and Management Science

IBM SPSS Statistics. Editions. Get the analytical power you need for better decision making. Why use IBM SPSS Statistics? IBM Analytics Solution Brief

Decision Support and Business Intelligence Systems

Machine learning in neuroscience

Netflix Optimization: A Confluence of Metrics, Algorithms, and Experimentation. CIKM 2013, UEO Workshop Caitlin Smallwood

CS 101C: Bayesian Analysis Convergence of Markov Chain Monte Carlo

Code Compulsory Module Credits Continuous Assignment

DATA ANALYTICS WITH R, EXCEL & TABLEAU

DECISION SCIENCES UNDERGRADUATE GRADUATE FACULTY COURSES. Bachelor's program. Combined programs. Master's programs. Explanation of Course Numbers

CREDIT RISK MODELLING Using SAS

Lab Rotation Report. Re-analysis of Molecular Features in Predicting Survival in Follicular Lymphoma

IMPROVING GROUNDWATER FLOW MODEL PREDICTION USING COMPLEMENTARY DATA-DRIVEN MODELS

Customer Relationship Management in marketing programs: A machine learning approach for decision. Fernanda Alcantara

THREE LEVEL HIERARCHICAL BAYESIAN ESTIMATION IN CONJOINT PROCESS

Stochastic simulation of photovoltaic electricity feed-in considering spatial correlation

Rmetrics FAQ. An Environment for Teaching Financial Engineering and Computational Finance with R Rmetrics Built Built 221.

Predictive Analytics Using Support Vector Machine

Sales Forecast for Rossmann Stores SUBMITTED BY: GROUP A-8

PREDICTION OF CONCRETE MIX COMPRESSIVE STRENGTH USING STATISTICAL LEARNING MODELS

SPM 8.2. Salford Predictive Modeler

Bayesian Monitoring of A Longitudinal Clinical Trial Using R2WinBUGS

Bayesian forecasting in economics

Genomic models in bayz

BIG DATA SKILLS: CHALLENGES FOR THE UNIVERSITY WORLD CREATING A NEW GENERATION OF DATA SCIENTISTS. Massimiliano Marcellino Bocconi University

Load Forecast Uncertainty Kick-Off Discussion. February 13, 2013 LOLEWG Item-3

Georgia Papacharalampous, Hristos Tyralis, and Demetris Koutsoyiannis

Online Technical Appendix: Key Word. Aggregations. Adaptive questions. Automobile primary and secondary. Automobiles. markets

Genomic Selection with Linear Models and Rank Aggregation

POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM)

Ensemble Modeling. Toronto Data Mining Forum November 2017 Helen Ngo

Department of Economics, University of Michigan, Ann Arbor, MI

Active Learning for Conjoint Analysis

Practical Application of Predictive Analytics Michael Porter

Seminar Master Major Financial Economics : Quantitative Methods in Finance

Who Is Likely to Succeed: Predictive Modeling of the Journey from H-1B to Permanent US Work Visa


TDWI Analytics Fundamentals. Course Outline. Module One: Concepts of Analytics

Choosing Optimal Designs for Pair-Wise Metric Conjoint Experiments

Choosing the Right Type of Forecasting Model: Introduction Statistics, Econometrics, and Forecasting Concept of Forecast Accuracy: Compared to What?

Short-term Electricity Price Forecast of Electricity Market Based on E-BLSTM Model

2016 INFORMS International The Analytics Tool Kit: A Case Study with JMP Pro

Audience Reach Projection for the 2010 FIFA World Cup in South Africa

Olin Business School Master of Science in Customer Analytics (MSCA) Curriculum Academic Year. List of Courses by Semester

Building the In-Demand Skills for Analytics and Data Science Course Outline

Learning User Real-Time Intent for Optimal Dynamic Webpage Transformation

Cycle Time Forecasting. Fast #NoEstimate Forecasting

Predicting Reddit Post Popularity Via Initial Commentary by Andrei Terentiev and Alanna Tempest

Using AI to Make Predictions on Stock Market

Quantifying Uncertainty in Oil Reserves Estimate

Spatial Demand Estimation: Moving Towards Real-Time Distribution System Network Modeling

Describing DSTs Analytics techniques

NEW CURRICULUM WEF 2013 M. SC/ PG DIPLOMA IN OPERATIONAL RESEARCH

Course # January 10 March 20, Room GEN-S202 UC SF. Department of. 1/10/2012 N. Schuff 1/5 MI Square University of of California

Exploiting full potential of predictive analytics on small data to drive business outcomes

Propagating Uncertainty in Multi-Stage Bayesian Convolutional Neural Networks with Application to Pulmonary Nodule Detection

FORECASTING THE GROWTH OF IMPORTS IN KENYA USING ECONOMETRIC MODELS

Risk Modeling School. Operational Risk Modeling Course Curriculum & Details

Uncertainty in transport models. IDA workshop 7th May 2014

Non-Operational Stockpile Reliability Prediction Methods Using Logistic Regression Techniques

MATLAB Seminar for South African Reserve Bank

Mathematical Forecasting of the Tourism Activity Importance in Chinese Economy based on Holt s Exponential Smoothing Model Juxing Wang1, a

Cluster-based Forecasting for Laboratory samples

Big Data. Methodological issues in using Big Data for Official Statistics

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

A Data-driven Prognostic Model for

Prospects for Economics in the Machine Learning and Big Data Era

Copyright 2013, SAS Institute Inc. All rights reserved.

FACTOR-AUGMENTED VAR MODEL FOR IMPULSE RESPONSE ANALYSIS

Advanced analytics at your hands

PREDICTION OF PIPE PERFORMANCE WITH MACHINE LEARNING USING R

Transcription:

The 1 th IEEE International Conference on Data Stream Mining & Processing 23-27 August 2016, Lviv, Ukraine Linear, Machine Learning and Probabilistic Approaches for Time Series Analysis B.M.Pavlyshenko Ivan Franko National University of Lviv, e-mail: b.pavlyshenko@gmail.com In this paper we study different approaches for time series modeling. The forecasting approaches using linear models, ARIMA alpgorithm, XGBoost machine learning algorithm are described. Results of different model combinations are shown. For probabilistic modeling the approaches using copulas and Bayesian inference are considered. Keywords time series; ARIMA; forecasting; predictive analytics. I. INTRODUCTION Time series analysis, especially forecasting, is an important problem of modern predictive analytics. The goal of this study is to consider different aproaches for time series modeling. For our analysis, we used stores sales historical data from Kaggle competition Rossmann Store Sales [1]. These data represent the sales time series of Rossmann stores. For time series forecasting such approaches as linear models and ARIMA algorithm are widely used. Machine lerning algorithm make it possible to find patterns in the time series. Sometimes we need to forecast not only more probable values of sales but also their distribution. Especially we need it in the risk analysis for assessing different risks related to sales dynamics. In this case, we need to take into account sales distributions and dependencies between sales time series features (e.g. day of week, month, average sales, etc.) and external factors such as promo, distance to competitors, etc. One can consider sales as a stochastic variable with some marginal distributions. If we have sales distribution, we can calculate value at risk (VaR) which is one of risk assessment features. In probabilistic analysis of sales, we can use copulas which allows us to analyze the dependence between sales and different factors. To find distributions of model parameters Bayesian inference approach can be used. II. LINEAR MODELS AND MACHINE LEARNING APPROACHES For our analysis, we used stores sales historical data. To compare different forecasting approaches we used two last months of the historical data as validation data for accuracy scoring using root mean squared error (RMSE). For the comparison, we used the following methods: ARIMA using R package forecast [2], linear regression with LASSO regularization using R package lars [3], conditional inference trees with linear regression on the leaf using mob() function from R package party [4], gradient boosting XGBoost model using R package xgboost [5]. Package xgboost (short term for extreme Gradient Boosting) is an efficient and scalable implementation of gradient boosting framework [6,7]. The package includes efficient linear model solver and tree learning algorithm. We also used combined approaches such as linear blending ARIMA and gradient boosting model, stacking with the use of linear regression on the first step and gradient boosting on the second step. We used two ways of classifications the first way is based on the time series approach and the second one is based on the independent and identically distributed variables. We consider sales in the natural logarithmic scale. Figure 1 shows typical time series of store sales. Figure 1. Typical time series of store sales. Figure 2 shows the example of time series forecasting by different methods with RMSE error metric. Let us consider the case of time series forecasting using linear blending of ARIMA and XGBoost models. For arbitrary chosen store (Store 285) we received RMSE=0.11 for ARIMA model, RMSE=0.107 for XGBoost model and RMSE=0.093 for linear blending of ARIMA and XGBoost models. Let us consider the case of using stacking with linear regression on the first step and xgboost on the second step. For arbitrary chosen store (Store 95) we received RMSE=0.122 for XGBoost model and RMSE=0.117 for stacking model.

We also studied the case of time series forecasting using XGBoost model with time series approach and xgboost model based on independent and identically distributed variables. For arbitrary chosen store (Store 95) we received RMSE=0.138 for XGBoost model with time series approach and RMSE=0.118 for XGBoost model with i.i.d approach. The obtained results show that for different stores the best accuracy is released by different approaches. For each type of time series, we may develop an optimized approach which can be based on the combination of different predictive models. Figure 2. Time series forecastings by different methods. III. COPULAS APPROACH FOR MODELING A copula is a multivariate probability distribution for which the marginal probability distribution of each variable is uniform. Copulas are used to describe the dependence between random variables. Sklar's Theorem states that any multivariate joint distribution can be written in terms of univariate marginal distribution functions and a copula, which describes the dependence structure between the variables. The copula contains all information on the dependence structure between the variables, whereas the marginal cumulative distribution functions contain all information on the marginal distributions. For the case study, we use the same sales time series, which represent sales in the stores network. We used copula R package [8] for modeling. We consider sales in the natural logarithmic scale. For our analysis, we take such features as sales (variable logsales), previous day sales (variable prevlogsales), number of customers that visited a store (variable Customers). First of all we take one sales time series for one arbitrary store. Marginal distributions and dependencies with correlation coefficient are shown on the figure 3. On the figures 4,5 the pseudo observations of investigated features are shown. These figures represent stochastic dependencies of investigated variable. Our next objective is to find such copulas, which will approximate these dependencies. We chose t-copula for modeling. Using maximum likelihood method, one can find fitting parameters for copula. The probability density function for calculated fitted t-copula is shown on the figure 6. Figure 3. Marginal distributions and correlation coefficient.

distribution function (CDF) to each dimensional variable of fitted copula. To construct multivariate distribution of dependent variables, we chose gamma distribution for marginal distributions of logsales and Customers variables. Having fitted copula and finding parameters for these gamma distributions from historical data, we generate pseudo-random samples with the probability density function (PDF), shown on the figure 7. Figure 4. Pseudo observations for logsales and prevlogsales variables. Figure 7. The PDF of generated pseudo-random samples using fitted t-copula and marginal distributions. Figure 5. Pseudo observations for logsales and Customers variables. If we need to analyze multivariate dependences with more than two variables, it is effective to use vine copulas, which enable us to construct complex multivariate copula using bivariate ones. For studying vine copula, we used CDVine R package [9]. Let us consider such features of sales time series as sales (variable logsales), mean sales per day for store (variable meanlogsales) and promo action (variable Promo). In this case, we analyze sales for stores chain. To analyze the stochastic dependence we used canonical vine copula. First tree for fitted canonical vine copula with Kendall s tau values is shown on the figure 8. Figure 6. The probability density function for fitted t-copula. Having fitted copula and marginal distributions, we can generate pseudo-random samples of investigated variables by applying inverse marginal comulative Figure 8. The tree for fitted canonical vine copula with Kendall s tau values.

We chose a t-copula for the logsales-promo dependency and a normal for the logsalesmeanlogsales dependency. The pseudo observations for constructed canonical vine copula are shown on the figure 9 in the dimension of logsales and meanlogsales. The trace plot demonstrates the stationary process, which means good convergence and sufficient burn-in period in the MCMC algorithm. The similar trace plots were received for other coefficients in the linear regression. The density of distributions of some regression coefficients for chosen arbitrary store are shown on the figure 11. Figure 9. Pseudo-observations for constructed canonical vine copula. As the case study shows, the use of copula make it possible to model stochastic dependencies between different factors of sales time series separately from their marginal distributions. This can be considered as an additional approach in the sales time series analysis. Figure 11. Density of distributions of regression coefficients. The figure 12 shows the examples for box plots for some regression coefficients IV. BAYESIAN INFERENCE For Bayesian inference case study, we take such features as promo, seasonality factors (week day, month day, month of year). As well as in the previous studies, we consider sales in the natural logarithmic scale. As the example we take one sales time series for one arbitrary store. For Bayesian inference, we used Markov Chain Monte Carlo (MCMC) algorithm from MCMCpack R package [10]. For time series modeling, we used the linear regression with Gaussian errors. Trace plots of samples vs the simulation index can be very useful in assessing convergence. The trace plot for promo coefficient is shown on the figure 10. Figure 12. Box plots for regression coefficients. Figure 10. Trace plot for promo coefficient. Sales time series can have outliers and it is important to take into account this fact using heavy tails distributions instead of Gaussian distribution. For the linear regression with variables with different type of distributions we used Bayesian hierarchical model. We conducted the case study using JAGS sampler [11] software with rjags R package. For modeling, we take into account mean sales for the store, sales, and promo. We consider sales and mean sales for the store in the natural logarithmic scale. For mean sales for stores, we used Gaussian distribution, for sales Student distribution, and for promo Bernoulli

distribution. In this model, we consider sales as an independent and identically distributed random variable without separating sales for different stores. Info about the store is represented by mean sales for the store values. Mean sales for the store (variable meanlogsales) vs sales (variable logsales) obtained for considered Bayesian model are shown on the figure 13. Figure 13. Mean sales for the store (variable meanlogsales) vs sales (variable logsales) obtained for fitted Bayesian model. As the case study shows, the use of Bayesian approach allows us to model stochastic dependencies between different factors of sales time series and receive the distributions for model parameters. Such an approach can be useful for assessing different risks related to sales dynamics. V. CONCLUSION In our cases study we showed different approaches for time series modeling. Forecasting with using linear models, ARIMA algorithm, xgboost machine learning algorithm are described. The results of different models combinations are shown. For probabilistic modeling, the approach with using copulas is shown. The Bayesian inference was applied for time series linear regression case. For time series forecasting the different models combinations technics can give better RMSE accuracy comparing to single algorithms. The probabilistic approach for time series modeling is important in the risk assessment problems. The copula approach gives one the ability to model probabilistic dependence between target values and extreme factors which is useful when a target variable has non Gausian probability density function with heavy tails. Bayesian models can be used to find distributions of coefficients in the linear model of time series. Having model parameters distributions one can find the distribution of target values using Monte-Carlo approaches. For each type of time series, one can develop an optimized approach, which can be based on the combination of different predictive models. REFERENCES [1] Rossmann Store Sales, Kaggle.Com, URL: http://www.kaggle.com/c/rossmann-storesales. [2] Hyndman, R., & Khandakar, Y. (2008). Automatic Time Series Forecasting: The forecast Package for R. Journal of Statistical Software, 27(3), pp.1 22, 2008 [3] Efron, Bradley, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. "Least angle regression." The Annals of statistics 32, no. 2,pp.407-499, 2004 [4] Zeileis, Achim, Torsten Hothorn, and Kurt Hornik. "Model-based recursive partitioning." Journal of Computational and Graphical Statistics 17.2, pp.492-514, 2008 [5] Chen, Tianqi, and Carlos Guestrin. "XGBoost: A Scalable Tree Boosting System." arxiv preprint arxiv:1603.02754 (2016). [6] J. Friedman. Greedy function approximation: a gradient boosting machine., Annals of Statistics, 29(5):1189 1232, 2001. [7] J. Friedman. Stochastic gradient boosting., Computational Statistics & Data Analysis, 38(4):367 378, 2002. [8] Yan, Jun. "Enjoy the joy of copulas: with a package copula." Journal of Statistical Software, 21.4, pp.1-21, 2007 [9] Brechmann, Eike Christian, and Ulf Schepsmeier. "Modeling dependence with C- and D-vine copulas: The R-package CDVine." Journal of Statistical Software, 52.3, pp. 1-27, 2013 [10] Quinn, Kevin M., Andrew D. Martin, and Jong Hee Park. "MCMCpack: Markov chain Monte Carlo in R." Journal of Statistical Software 10.II (2011). [11] Martyn Plummer. JAGS Version 3.4.0 user manual. URL:http://sourceforge.net/projects/mcmcjags/files/Manuals/3.x/jags_user_manual.pdf