Day-ahead Financial Loss/Gain Modeling and Prediction for a Generation Company

Size: px
Start display at page:

Download "Day-ahead Financial Loss/Gain Modeling and Prediction for a Generation Company"

Transcription

1 Day-ahead Financial Loss/Gain Modeling and Prediction for a Generation Company A Doostmohammadi, Student Member, IEEE, N Amjady, Senior Member, IEEE, and H Zareipour, Senior Member, IEEE 1 Abstract--In an electricity market, the main goal of a generation company (GenCo) is to maximize its profit, while encountering the uncertainty of electricity price forecast Different risk measures have been proposed to cope with this source of uncertainty However, those are usually before-thefact performance indices and cannot give a measure for the financial loss/gain (FLG) of a GenCo considering the electricity prices actually realized in the market his paper focuses on this matter he time series of FLG is first constructed given the real conditions of the electricity market hen, the FLG time series is quantized using Silhouette Criterion and k-means clustering approach Subsequently, based on the historical values of the quantized FLG time series and relevant exogenous variables, its day-ahead values are predicted he method proposed for day-ahead FLG prediction consist of conditional mutual information (CMI) and sequential forward search (SFS) as the feature selection technique and extreme learning machine (ELM) as the forecasting engine he effectiveness of the whole proposed approach, including the FLG time series construction, quantization approach and the prediction method, is shown for a typical GenCo using the real data of the PJM and Ontario electricity markets Index erms-- Generation Company (GenCo), Electricity Price Forecast, Financial Loss/Gain (FLG), quantization, feature selection, extreme learning machine ARIMA CFS CMI CMIM DISR ELM FCBF FLG IG IMP MAPE MI MLP MP MRMR SFS E VE LIS OF ABBREVIAIONS Auto regressive moving average Correlation-based feature selection Conditional mutual information Conditional mutual information maximization Double input symmetrical relevance Extreme learning machine Fast correlation-based feature selection Financial loss/gain Information gain Improved monthly profit Mean absolute percentage error Mutual information Multi-layer perceptron Monthly profit Maximum relevancy minimum redundancy Sequential forward search est error Validation error I INRODUCION A Background and Motivation M any studies in recent years have focused on providing efficient bidding strategies for GenCos to obtain A Doostmohammadi and N Amjady are with the Department of Electrical Engineering, Semnan University, Semnan, Iran ( adoostmohamadi@studentssemnanacir; amjady@semnanacir) H Zareipour is with the department of Electrical and Computer Engineering, Schulich School of Engineering, University of Calgary, Calgary, Alberta, Canada ( hzareipour@ucalgaryca) maximum profit in electricity markets For participating in a day-ahead forward market, a GenCo needs short-term price forecasts Numerous forecasting methods have been presented for this purpose A review of electricity price forecast methods can be found in [1] Despite the performed research works in the area, price forecasts have still considerable error in real-world electricity markets due to the volatility of electricity price signals For instance, different short-term price forecast methods have been evaluated in [2] wherein it is concluded that the price prediction error ranges approximately from 5% to 36% depending on the forecasting methods and electricity markets considered Price forecast errors can have a significant impact on a GenCo's profit as it directly affects its self-scheduling results [3] For this reason, some recent research works have studied this matter In [4], the expected loss of profit for a GenCo, due to price forecast inaccuracy, is evaluated in a price-based unit commitment problem he relation between loss of profit and mean absolute percentage error (MAPE) of a specific price forecast for four different types of power plants is analyzed In [5], exploiting imperfect electricity prices for scheduling the short-term operation of demand side market participants is analyzed An index called forecast inaccuracy economic impact is presented based on the costs obtained from perfect prices, ie, with no forecast errors, and predicted prices, ie, with forecast errors his index is calculated for two industrial loads with different characteristics using Ontario electricity market prices, and it is concluded that the well-known error measure of MAPE may not completely evaluate the economic value of various price forecast methods he research work of [5] is extended to GenCos in [6] he economic impact of four different price forecast methods on the self-scheduling performance of hydro and thermal-based GenCos is evaluated in [6] through two profit-based indices he findings of [6] are in line with [5] in the sense that the conventional error measures may not be efficient to measure the economic value of different electricity price forecasts he economic benefit of electricity price forecast for demand-side participants using the dayahead scheduling of load-shifting industrial plants is analyzed in [7] Rank correlation is also suggested due to its capability of capturing similarity between actual and predicted prices in order to assess the economic benefits Additionally, it has been shown that economic benefits greatly depend on the volatility of electricity prices he objective of this research work is presenting a new financial loss/gain (FLG) time series, which captures the effect of imperfect price forecasts on the profit of a GenCo he significance of the FLG as an effective tool to guide the bidding strategy of GenCos to maximize their realized profits (which are the differences between the expected profits based on the price forecast and the financial loss/gain due to errors in the price forecasts) is shown Additionally, a new forecast strategy is specifically designed for FLG prediction Electricity price is an important signal for a GenCo as a

2 2 profit-based company Price forecast methods can provide an outlook of the day-ahead market for a GenCo However, because of price forecasting errors, the expected profit of a GenCo is not realized in the market o address this issue, we propose a profit-based time series (ie the proposed FLG), which is more efficient than the previous price-based measures While the price forecast-based analysis and indices cannot be used directly to adjust the bidding strategy, FLG can directly improve the bidding strategy, since it measures the real financial loss or gain of a GenCo Most of the energy trading of an electricity market is performed in day-ahead segment and thus most of FLG of a GenCo is related to day-ahead market For this reason, this paper focuses on FLG prediction in day-ahead market he proposed approach can be easily adapted to real-time market, but that will need to be carried out in a separate paper B Contributions and Paper Organization he main contributions of this paper can be summarized as follows: 1) A novel profit-based time series to measure the financial loss/gain (FLG) of a GenCo is presented and modeled as an hourly time series 2) he FLG signal is quantized using Silhouette criterion and k-means clustering technique to simplify the prediction process and to provide more manageable information for GenCos 3) Since many candidate inputs affect the FLG signal in a nonlinear and interacting way, a new nonlinear informationtheoretic-based feature selection technique, composed of conditional mutual information (CMI) and sequential forward search (SFS), is presented to select a minimum subset of the most informative features for the FLG forecast process 4) By combining of the proposed quantization approach (ie Silhouette criterion + k-means clustering technique), the proposed feature selection method (ie CMI+SFS), and the extreme learning machine (ELM) classifiers adopted as the forecasting engine, a new forecast strategy is presented for FLG prediction With the aid of these predictions, a GenCo can improve its decisions and adopt a better bidding strategy In fact, both the introduction of FLG time series and the techniques proposed to forecast it are the main contributions of this paper Neither the FLG time series (ie, the first contribution) nor the above-mentioned techniques (ie, the second, third and fourth contributions) have been presented in the literature before Although the proposed quantization and feature selection techniques, introduced in the above contributions 2 and 3, have been applied for FLG prediction in this paper, they can also be used separately, ie separate from the FLG prediction strategy, (eg in data mining and data representation analysis) Additionally, these techniques can be used for other forecast processes involving highly volatile time series In other words, these two techniques have their own novelty and application For this reason, they are presented as separate contributions here he rest of this paper is organized as follows In the second section, the FLG signal is introduced and modeled as a time series he proposed forecast strategy to predict the future values of FLG is presented in Section III he proposed FLG quantization approach and informationtheoretic-based feature selection technique are also detailed - Historical Price Forecasts - Historical Price Forecast Errors - Price Forecasts of Future Hours Price Forecast Module Fig 1 Schematic overview of the proposed approach in this section he obtained numerical results for FLG prediction are presented and discussed in Section IV Section V concludes the paper II CONSRUCION OF FLG IME SERIES A schematic overview of the proposed approach is illustrated in Fig 1 As shown in this figure, a selfscheduling module prepares unit bids or generation offer based on price forecasts and unit data (eg, the marginal cost of units and the capacity of units) his part of the figure, illustrated in black, depicts a conventional bidding strategy he red part of the figure, including FLG prediction strategy and unit bid adaptation, indicates what has been added to the conventional bidding strategy in this research work he unit bid adaption block improves unit bids based on the dayahead FLG prediction In this way, the FLG prediction can be used to adapt the bidding strategy of a GenCo to attain higher profit in the financial gain cases and avoid loss in the financial loss cases In the following, the self-scheduling problem of GenCos is first reviewed briefly and, based on it, the FLG signal is formulated hen, the FLG signal is modeled as a time series and its important characteristics are analyzed A Self-scheduling and Financial Loss/Gain Modeling he self-scheduling problem of a GenCo can be briefly modeled as: max OF = Profit t = f(u t, P t, λ t ), Day-ahead Price Forecast Self-Scheduling Module FLG Prediction Strategy Day-ahead FLG Prediction Unit Bid Adaption subject to P t, U t Ω where OF is the total profit of the GenCo over the planning horizon including time intervals; Profit t = f(u t, P t, λ t ) is the profit of the GenCo in time interval t; U t, P t, and λ t represent the ON/OFF status of the GenCo's units (commitment variables), the generation of GenCo's units (dispatch variables), and the market price in planning interval t; Ω indicates the feasible region of the GenCo shaped by the operational constraints of the units, such as capacity limits, ramp up/down limits, and minimum up/down time Unit data Unit Bids (Generation Offer) Improved Unit Bids (Improved Generation Offer) (1)

3 3 constraints Details of these constraints can be found in [8] Since market prices are not available at the time of solving this problem, price forecasts are instead used Hence, (1) is reformulated as: max E[OF] = E[Profit t ] = E[f(U t, P t, λ t )] = f(u t, P t, λ t f ), subject to P t, U t Ω where, λ f t is the forecasted price in time interval t; E[ ] is the expected value operator he last part of (2) is obtained considering electricity price as an uncertain variable with the price forecast as its expected value Assuming that the solution of (2) is denoted by U t and P t, the expected profit gained by the GenCo, denoted by superscript exp, would be: OF exp exp = Profit t = f( U t, P t, λ f t ) However, the GenCo is paid based on the market clearing prices of the next day, ie real after-the-fact prices, represented by λ r t,,, hus, the real profit of the GenCo, denoted by superscript r, becomes: OF r r = Profit t = f( U t, P t, λ r t ) In (4), it is assumed that the GenCo is not allowed to modify its submitted bids and reschedule its units during the operating period Also, it is assumed that the self-schedule of GenCo is accepted in the market in line with the previous research works in the area of bidding strategy, such as [3], [4], [6] and [8] Financial Loss/Gain (FLG) is defined as: FLG t = Profit t exp Profit r, t = 1,, (5) he positive/negative values of FLG t indicate the financial loss/gain of the GenCo for time interval t Note that FLG is different from Economic Loss Index and Price Forecast Disadvantage Index presented in [6] hese indexes try to measure the economic impact of price forecast inaccuracy hus, these indexes are based on comparing the expected profit of the GenCo with the ideal profit obtained from a fictitious scenario, ie when the real prices are hypothetically available and self-scheduling is done based on them On the other hand, FLG is based on comparing the expected and real profits (and not the ideal profit) In other words, FLG evaluates the real financial loss or gain of GenCo, ie the profit that a GenCo really loses or achieves with respect to what is expected FLG is also different from the risk measures used in some previous self-scheduling research works Risk indices, such as value-at-risk [9], conditional value-at-risk [10], and down-side risk [11], are before-the-fact criteria that measure the financial risk threatening a self-schedule based on the likely low profit scenarios On the other hand, FLG is an after-the-fact index, ie it gives a measure of the profit that a GenCo actually loses after the market clearing with respect to the expected profit For this reason, a forecast strategy is proposed in this paper to estimate FLG Moreover, FLG can also measure the financial gains due to the favorable fluctuations of electricity prices Finally, FLG is different from the information-gap decision theory indices Although these indices can consider both financial loss and gain, similar to FLG, information-gap (2) (3) (4) decision theory takes a threshold for financial loss/gain and returns the associated robust/opportunistic self-schedules [12] On the other hand, FLG takes a specific self-schedule and determines its financial loss/gain he actual prices are not available for FLG calculation hus, the proposed FLG-based approach uses price forecasts similar to bidding strategies However, the proposed FLG prediction strategy can provide valuable after-the-fact information, ie an estimate of the real financial loss/gain, for a GenCo his key information is not given by the previous before-the-fact criteria A GenCo must prepare its day-ahead bids before the market is cleared herefore, actual day-ahead prices for day D are not available for a GenCo when it prepares the bids for day D [3] hus, a GenCo must forecast electricity prices of day D, say sometime on day D-1, to prepare its bids for day D Day-ahead FLG prediction has the same conditions of day-ahead price forecasting In other words, when we prepare generation bids for day D, actual FLG values up to hour 24 of day D-1 can be considered known, since calculation of FLG requires actual price values, and actual price values up to hour 24 of day D-1 can be considered known hus, to prepare generation bids for day D, the FLG values of day D should be predicted, similar to predicting price values of day D In the next subsection, the predictability of FLG is evaluated Subsequently, a forecast strategy is proposed to predict it in section III B Volatility Analysis of FLG ime Series he volatility and the predictability of a time series are tightly coupled, such that higher volatility leads to lower predictability and vice versa [13] wo volatility measures are used in this paper he first one is in the time domain defined as the annualized standard deviation of the gradient of the normalized signal [14] Standard deviation represents the statistical dispersion of a data set and gradient represents the changes of a signal he combination of these two measures illustrates the statistical dispersion of signal changes Higher values of this statistical dispersion (ie, higher volatility) means more diverse variations of the signal, which make it less predictable, since the forecasting engine should learn more complex behaviors of the signal Additionally, for a signal with diverse variations, the forecasting engine may encounter variation patterns in the prediction phase that their similar patterns have not been occurred in the training phase hus, the forecasting engine has not learned these variation patterns of the signal, which can lead to high prediction errors in such cases Normalization in the definition of this volatility metric allows the comparison of different signals and makes it independent of the signal amplitude he second measure is in the frequency domain, which evaluates the high-frequency content of the signal (or equivalently the share of sharp ramps and sudden changes) It is defined as the ratio of energy pertaining to high frequency component of discrete wavelet transform and the total energy of the signal [13] A greater value of this metric indicates more sharp ramps and more non-smooth behavior of the signal Sharp ramps and non-smooth behavior of a signal make predicting its future values more complex and cause higher prediction errors (ie, the signal becomes less

4 4 predictable) hus, the two volatility metrics in time domain and in frequency domain can represent important characteristics of a signal, which are relevant for its prediction he numerical results obtained for these volatility measures are presented in able I o give a better insight about the volatility of FLG signal, the results of these two volatility measures for the load, electricity price and FLG in the same market (ie PJM in US and Ontario in Canada) and the same period (ie year 2011) are compared in able I Hourly real-time market prices are used to evaluate price volatility in both PJM and Ontario electricity markets Also, commitment and dispatch variables obtained from the selfscheduling model presented in [8] have been used to calculate FLG he time-domain and frequency-domain volatility measures are calculated for exactly the same time series that should be forecasted hus, the effects of different components, such as short-run trend or periodical elements, are not removed from the load, electricity price, or FLG signals It is seen that the volatility of electricity price is higher than that of load based on the two measures and in the both markets his result is in line with the observations reported in price forecast literature such as [15] In turn, able I shows that the volatility of FLG is higher than that of electricity prices in terms of both time and frequency domain measures and in both markets o illustrate the effect of this higher volatility, the time series of normalized load, price and FLG of PJM electricity market is shown in Fig 2 for September, 2011 he employed price forecast method to generate the FLG time series of Fig 2 will be introduced in Section IV Each time series of Fig 2 is normalized by dividing to its maximum value he load time series represents smooth variations and nearly periodic pattern However, outliers and sudden changes as well as weak periodic behavior are observed in the price time series he behavior of the FLG time series is similar to the behavior of the price time series with more non-smooth variations and irregular fluctuations For a forecasting engine, learning the variation patterns of FLG time series in Fig 2(c) and tracking its changes are much more complex than learning the variation patterns of load time series in Fig 2(a) and tracking its changes Additionally, the FLG changes in both positive and negative domains, unlike the load and price signals that, only change in the positive range (except some rare negative values for the electricity price time series) III HE PROPOSED FLG FORECAS SRAEGY he architecture of the proposed forecast strategy is shown in Fig 3 It is seen that the strategy is composed of three parts At first, FLG time series is quantized using Silhouette criterion and k-means clustering technique to simplify the next feature selection and forecast processes hen, a new feature selection method, based on Conditional Mutual Information (CMI), and Sequential Forward Search (SFS), is applied to find a minimum subset of the most informative features for the FLG forecast process Finally, an Extreme Learning Machine (ELM), which is fed by the selected input variables of the feature selection method, is used as the forecasting engine It predicts the class label of the quantized FLG signal for the future time intervals he ABLE I VOLAILIY COMPARISON OF LOAD, PRICE AND FLG IME SERIES OF PJM AND ONARIO ELECRICIY MARKES IN YEAR 2011 PJM Ontario Market Load Price FLG Load Price FLG Volatility measure in time domain Volatility measure in frequency domain Fig 2 Normalized load (a), price (b) and FLG (c) for September, 2011 in PJM electricity market feature selection and forecasting engine are separately implemented for each hour of the forecasting horizon (eg, the next day) he three parts of the FLG forecast strategy are described in the next subsections, respectively A Quantizing the FLG signal It should be noted that point estimates, ie directly forecasting the values, for FLG, considering its highly volatile behavior, may lead to unreasonable prediction errors At the same time, determining FLG class label, indicating its range, usually is a sufficiently informative feedback for a GenCo to modify its bidding strategy, if proper FLG ranges are specified hese two reasons motivate classifying FLG and predicting the FLG class labels instead of its values he initial idea of this process has taken from [16], which proposes classification instead of direct estimation for predicting highly volatile price spikes Moreover, quantizing the output variable usually results in a faster and more precise training of classification tools [17] he quantization procedure consists of arranging the continuous feature (here, the output variable or FLG) into a finite number of intervals he performance of the quantization strongly depends on the number of generated intervals and the location of cutting points which separate each interval in a way that the whole domain of FLG values are occupied he proposed FLG quantization approach finds the optimum number of classes and optimizes the class arrangement by a combination of Silhouette criterion [18] and k-means clustering technique Suppose that NS historical samples of FLG are available (eg one year or 8760 hourly samples) Each observation of FLG consists of a single value (ie historical FLG in one hour) he performance of the proposed quantization approach can be summarized as the following step-by-step algorithm: Step 1) he proposed approach begins with the lowest number of classes hus, the number of classes, denoted by NC, is set to 2 Step 2) he NS historical samples are classified into NC classes using k-means clustering technique his technique is based on the well-known Euclidean distance between the cluster centroids and samples Details of k-means clustering method can be found in [19] However, the classes obtained

5 5 Normalized FLG time series Quantization (Silhouette criterion + k-means clustering technique) Normalized relevant exogenous variables (Price forecast, price forecast error and unit bids) Feature Selection CMI+SFS (Hour 1) CMI+SFS (Hour 24) Selected features for hour 1 Selected features for hour 24 Forecasting Engine Classifier of hour 1 (ELM) Classifier of hour 24 (ELM) FLG(t+1) class label prediction FLG(t+24) class label prediction Fig 3 he proposed FLG forecast strategy from k-means clustering method might not occupy the whole domain of the output variable For instance, a gap may exist between two successive classes o remedy this problem, cutting points are defined for the clusters obtained from the k-means technique he cutting point between two successive classes is the average of the upper bound of the lower class and lower bound of the upper class he cutting point is considered as the border of the two successive classes After defining the cutting points, the obtained classes will occupy the whole domain of the target variable Step 3) he goodness-of-fit for the classification performed in the previous step is evaluated by the Silhouette criterion For this purpose, the Silhouette criterion for each sample measures the similarity of that sample to the samples in its own cluster and samples in the other clusters as follows: n(i) m(i) s(i) = (6) max {m(i), n(i)} where m(i) is the average dissimilarity of sample i with its own cluster and n(i) is the lowest average dissimilarity of sample i with the other clusters he average dissimilarity between sample i and a cluster is measured in terms of average Euclidean distance between sample i and the samples of that cluster Based on this definition, the dissimilarity between two identical samples is zero In (6), s(i) ( 1, +1) represents the Silhouette value of i th sample Low m(i) values lead to low dissimilarity between samples of a cluster, or equivalently high similarity between them, which means that the cluster is well-organized High n(i) values lead to high dissimilarity between different clusters, which means that the clusters are well-separated Low m(i) and high n(i) result in high s(i) and vice versa If most samples have small positive or negative Silhouette values, the preformed classification is poor, eg, it may have either too many or too few clusters hus, the average of s(i) values over all samples presents a single measure for the goodness-of-fit of the classification: NS SC NC = 1 NS s(i) where SC NC is the Silhouette criterion value for the classification obtained by NC classes Step 4) If NC < NC max, NC=NC+1 and go back to step 2; otherwise, the algorithm is terminated and the NC leading to the highest SC NC value obtained so far indicates the optimum number of classes and its associated classification represents the optimum classification solution he classes of FLG are i=1 (7) defined based on this solution he maximum number of classes or NC max is a setting of the proposed quantization approach, which is selected based on the engineering judgment and acceptable computation burden B Feature Selection Method In a feature selection method, a feature represents a candidate input In this research work, feature selection is used to filter out the ineffective candidate inputs to enhance the effectiveness of the subsequent forecasting engine for constructing the input/output mapping function Fig 3 shows the position of the feature selection parts within the proposed forecast strategy Every feature selection part, dedicated to one hour of the forecasting horizon, is composed of CMI and SFS he feature ranking component of CMI determines the amount of information that a feature shares with the output variable (here, FLG class label) conditioned on the subset of selected features Before mathematical derivation of the CMI criterion, some data mining concepts including mutual information (MI), relevancy, redundancy and complementarity should be first introduced MI represents the amount of joint information between two random variables It is measured based on individual and joint entropies MI definition and formulation for two random variables can be found in [20] Hereafter, the MI of two random variables X and Y is shown by I(X; Y) Suppose X = {X 1,, X l, X m,, X n } is the n-dimensional vector of candidate inputs, and is the output variable For the FLG forecast process, I(X l ; ) denotes how much common information the candidate input feature X l and output variable share hus, higher/lower value of I(X l ; ) indicates that more/less information about can be obtained by considering X l and so I(X l ; ) can give a measure of the relevancy of X l for the forecast process of Additionally, I(X l ; X m ) represents redundant information or redundancy between two candidate inputs X l and X m Conditional MI or CMI is defined among three random variables as I(X l ; X m ), which measures the reduction of uncertainty of due to knowledge of X l when X m is given: I(X l ; X m ) = H( X m ) H( X l, X m ) (8) where H( ) is the conditional entropy, as an uncertainty measure, of its argument random variable [20] Note that H( X l, X m ) < H( X m ), which means that by adding the knowledge of the feature X l, the uncertainty of decreases In feature subset selection, CMI can give an evaluation of the complementary information achieved by adding a variable to the already selected subset of features he following

6 6 theorem provides connection between MI and CMI based on chain rule: heorem 1 he MI between a set of features X = {X 1,, X n } and output variable can be expressed in terms of n individual CMIs as below: I(X; ) = I(X 1,, X n ; ) = I (X i ; X 1, X i 1 ) (9) he proof of this theorem is given in [20] In (9), I(X; ) indicates the entire information about that the set X includes, according to the additivity property of MI [20] his also implies that the maximum information about can be achieved when all candidate features (ie all features of X) are considered However, X is usually a large set of features, which is not directly applicable to a forecasting engine In other words, a forecasting engine cannot learn the impact of all features of X on Even if the training process of the forecasting engine converges, its computation time will be so high and the quality of the training results cannot be guaranteed hus, it is strongly needed to filter out the features of X that are less informative for the forecast process of, ie the features that are insignificant in I(X; ), such as redundant features Using theorem 1, I(X; ) can be decomposed as follows: I(X; ) = I(S; ) + I(S ; S) (10) where S = {X 1,, X i } is the set of selected features and S = {X i+1,, X n } is the complement of the S, ie the set of notselected features Note that there is no feature before S = {X 1,, X i } in X hus, the first CMI in the right hand side of (10) does not have any condition according to (9) and so becomes MI, ie I(S; {}) = I(S; ) However, the second term is a CMI conditioned on the previously selected features S, ie I(S ; S) Since I(X; ) is constant, maximizing I(S; ) is equivalent to minimizing I(S ; S) o minimize I(S ; S), SFS is suggested here, which begins with an empty set and adds features one-by-one to the selected subset In general, consider step k of the proposed SFS in which the k th feature X k should be removed from the set of not-selected features of the previous step, ie S k 1, and added to the set of selected features of the previous step, ie S k 1 Using the chain rule of (10), the k th step can be formulated as follows: I(S k 1 ; S k 1 ) = I({X k, S k }; S k 1 ) = I(X k ; S k 1 ) + I(S k ; {S k 1 (11), X k }) I(S k ; S k ) = I(S k 1 ; S k 1 ) I(X k ; S k 1 ) (12) o derive (11) from (10), X should be replaced by S k 1 considering the condition of S k 1 Accordingly, for the first term in the right hand side of (11), ie I(X k ; S k 1 ), the condition of S k 1 is not changed, while for the second term, the selected feature of the first term, ie X k, should be added to the condition leading to I(S k ; {S k 1, X k }), which is I(S k ; S k ) in (12) In (12), I(S k 1 ; S k 1 ) is independent of the selected feature in step k, ie X k, since S k 1 and S k 1 are related to step k-1 hus, changing X k does not change I(S k 1 ; S k 1 ) and so this term appears as a constant term in (12) herefore, minimizing I(S ; S) in step k, which becomes minimizing I(S k ; S k ), is equivalent to maximizing I(X k ; S k 1 ) his is an interesting conclusion Its interpretation is that in each stage, we should select a feature that maximum mutual information with the output variable is achieved considering the previously selected n i=1 features (and thus considering possible redundancies/ complementarities with the previously selected features) However, computation of I(X k ; S k 1 ) is still difficult due to the presence of S k 1 in the condition part leading to highorder interactions o address this problem, the following approximation derived for the conditional mutual information based on the one-to-one dependencies [21] can be employed to compute I(X k ; S k 1 ): CMI(X k ) = I(X k ; S k 1 ) I(X k ; ) 1 1 S k 1 I(X k ; X i ) + S k 1 I(X k ; X i ) (13) X i S k 1 X i S k 1 hrough (13), I(X k ; S k 1 ) can easily be computed using only first order interactions, ie at most one variable is appeared in the condition parts his equation provides an approximation of I(X k ; S k 1 ) or CMI(X k ), which measures the mutual information between X k and conditioned on the previously selected feature subset S k 1 hus, CMI(X k ) can be used as an information value criterion to evaluate the effectiveness of the candidate input X k for predicting the output feature Combining CMI and SFS, the performance of the proposed feature selection method can be summarized as the following stage-wise procedure: Stage 1) Initialize k=1, S 0 = {}, and S 0 = X, where X indicates the set of candidate inputs Set the validation error of S 0, denoted by VE 0, as infinity VE indicates the misclassification rate, ie the percentage of the validation samples that their FLG classes are incorrectly predicted, obtained using the selected features he average of the 10 validation errors obtained from the 10-fold cross-validation [22] is used as the validation error VE k of the selected subset S k in this procedure A brief review of 10-fold crossvalidation is presented in the Appendix Stage 2) Rank features of S k 1 based on the CMI( ) criterion introduced in (13) Suppose that the candidate feature of S k 1 with the highest rank, ie highest CMI( ) criterion value, is shown by x r k 1 Stage 3) Construct S k = S k 1 {x r k 1 } and S k = S k 1 + {x r k 1 } Stage 4) Determine the validation error of S k, ie VE k, through the forecasting engine, which will be introduced in the next section If VE k < VE k 1, increment k (k=k+1) and go back to stage 2; otherwise, restore the previous S k 1 as the final subset of selected features and terminate the feature selection process It is seen from the above procedure that the subset of selected features at each stage is validated by the forecasting engine Due to the integration of the performance validation process for the classifier into the feature selection process, the generalization capability of the proposed forecast strategy is substantially improved, compared to classifier independent filtering methods C Forecasting Engine he selected features for each hour of the next day feed one independent ELM classifier to predict the associated FLG class label as shown in Fig 3 ELM is an efficient learning algorithm with high generalization capability and short training period for single-hidden layer feed-forward neural networks In the ELM learning method, the weights between the input and hidden layers are randomly chosen

7 7 hen, the weights connecting the hidden and output layers are analytically determined to minimize the classification error by calculating the Moore Penrose generalized inverse of the hidden layer output matrix Mathematical details of ELM learning method can be found in [23] he optimal number of hidden nodes, as a setting, for each ELM classifier of the proposed forecast strategy is determined by 10-fold cross-validation Similar to other iterative forecast methods, the proposed FLG forecast strategy uses its own predictions for hours of next day that their FLG values are not available For instance, to predict FLG for hour 8 of next day, FLG predictions of the proposed forecast strategy for hours 1-7 are used, provided that FLG(t 1) to FLG(t 7) are selected by the feature selection method for hour 8 IV NUMERICAL RESULS A thermal-based GenCo with four units, owning 780 MW capacity, is considered for the numerical experiments of this section he data of this GenCo, such as the unit specifications and initial conditions, are taken from [24] o increase the sensitivity of the GenCo s units to price forecast error, the minimum up/down times of units are limited to 4 hours he self-scheduling formulation presented in [8] is adopted for the GenCo his formulation assumes that the hourly Market Clearing Price is not changed by the GenCo s bid he self-scheduling formulation is a Mixed Integer Linear Programming optimization problem, which is solved using CPLEX solver within the GAMS platform [25] his self-scheduling model has been used in many other research works in the area, such as [6], [8] Additionally, this bidding strategy has only been used as a case study in this paper to numerically illustrate the effectiveness of the proposed approach However, the application of the proposed FLG forecast approach is not limited to any bidding strategy, price forecast method, or even market structure and it can be used with other alternatives he proposed FLG forecast approach only requires the historical FLG data, historical price forecast data, and historical price forecast error data as well as day-ahead price forecasts, and day-ahead generation bids as the inputs hese inputs can be easily provided in an electricity market If the bidding strategy, price forecast method or electricity market is changed, only the inputs of the proposed FLG forecast approach are changed without affecting the applicability of the proposed approach For instance, if the bidding strategy is changed only the historical values of FLG and the day-ahead generation bids are changed, but the proposed FLG forecast approach can again train the input/output mapping function between the new inputs and output (ie the future values of FLG) Afterward, by receiving the inputs pertaining to the next day, the proposed FLG forecast strategy can predict the day-ahead FLG values For the GenCo, the proposed FLG forecast strategy is tested on two PJM and Ontario electricity markets Hourly Ontario Energy Prices and prices of PJM day-ahead spot market are used for testing the proposed approach he price data of the PJM and Ontario electricity markets is obtained from their websites [26] and [27], respectively For the construction of FLG time series, day-ahead operating schedule must be first specified using the forecasted prices o obtain day-ahead price forecasts, a combination of the proposed feature selection method and 24 cascaded ELM regressions [22] (including one ELM for each hour) is used MAPE of this price forecast method for PJM and Ontario electricity markets in year 2011 is 1945% and 2735%, respectively Higher price forecast errors of the Ontario electricity market is consistent with higher volatility of the price time series of this market indicated in able I he proposed price and FLG forecast approaches are implemented within the programming environment of MALAB 2014a software package [28] It is worthwhile to note that the proposed FLG forecast strategy can work with the other price prediction methods However, as the price prediction is not the focus of this paper, it is not further discussed here A Quantization results he optimal number of classes and class ranges pertaining to FLG are determined using the proposed step-by-step quantization algorithm presented in Section IIIA he results obtained for SC NC (ie the Silhouette criterion given in (7)) with NC varying from 2 to 30 (ie NC max = 30) in the PJM and Ontario electricity markets are shown in Fig 4(a) and 4(b), respectively Since the test year of both PJM and Ontario electricity markets is 2011, the historical data of one year ago (ie 8760 hours) is used in this numerical experiment to obtain the optimum FLG quantization Fig 4 shows that the maximum value of SC NC for both PJM and Ontario test cases is obtained for NC=5 hus, this value of NC is considered for the next numerical experiments hese five classes are named as high gain, low gain, moderate, low loss, and high loss here, which their ranges and frequency distribution, obtained by the proposed quantization approach, are shown for PJM and Ontario electricity markets in ables II and III, respectively ables II and III show that the moderate class has the highest frequency, then low gain/loss classes have the lower frequency, and at last high gain/loss classes have the lowest frequency in both the markets Although the setting of NC max is selected based on the engineering judgment and acceptable computation burden, it is obvious that very high values for NC max would not be useful High NC max values lead to dividing compact and well-shaped clusters to two or more nearby clusters, which are not needed and only complicate the classification task hus, the cluster centers and cutting points cannot be determined precisely, which degrades the performance of the proposed quantization approach Here, by further increasing NC max from 30 in the numerical experiments of Fig 4(a) and 4(b), no better result is obtained Indeed, a setting of NC max =10 was sufficiently high for both PJM and Ontario electricity markets B Evaluating the performance of the proposed feature selection In this work, the historical data of 50 days prior to the forecast day is used for both feature selection and training of the forecasting engine [15] As shown in Fig 3, the candidate inputs of FLG forecast process include the lagged values of the normalized-quantized FLG as the autoregressive part as well as lagged and current values of normalized price forecast, lagged values of price forecast error and current values of unit bids as the exogenous variables hus, the set of the candidate inputs X becomes as

8 8 Fig 4 he results of the proposed quantization method for (a) PJM and (b) Ontario electricity markets in year 2010 with NC max = 30 ABLE II CLASS RANGE AND FREQUENCY DISRIBUION FOR FLG CLASSES OF PJM ELECRICIY MARKE Class label Class range Count Percent High Gain (-,-10268] % Low Gain (-10268,-1250] % Moderate (-1250,1293] % Low Loss (1293,7338] % High Loss (7338, ) % Sum % ABLE III CLASS RANGE AND FREQUENCY DISRIBUION FOR FLG CLASSES OF ONARIO ELECRICIY MARKE Class label Class range Count Percent High Gain (-,-23345] % Low Gain (-23345,-4070] % Moderate (-4070,3662] % Low Loss (3662,13609] % High Loss (13609, ) % Sum % follows: X = {FLG(t 1),, FLG(t 200), PF(t), PF(t 1),, PF(t 200), PFE(t 1),, PFE(t 200), UB 1 (t),, UB G (t)} (14) where FLG(t i), PF(t i) and PFE(t i) stand for FLG, price forecast, and price forecast error of i hours ago, respectively; PF(t) represents price forecast of the current hour; UB k (t) is the generation bid of k th unit for the current hour and G indicates number of generating units of the GenCo For each lagged set, 200 past values are considered in (14) to cover short-run dependencies, daily periodicities and weekly periodicities of the associated variable [15] he UB features only include the capacity offered by the GenCo (ie MW), since the price bids (ie $/MWh) are considered the same as the price forecast features ypical results obtained from the proposed feature selection method for the first hour of September 1, 2011 in the Ontario market are presented in able IV In this table, the feature selection step (ie k), selected feature in each step (ie x r k 1 ), CMI of the selected feature, and validation error (VE) are shown For adding a ranked feature to S k, two factors compete with each other: its information value (which enhances the performance of the forecasting engine) and the increased complexity of the training phase yielded by adding the feature (which decreases the effectiveness of the forecasting engine) In the proposed feature selection ABLE IV PROPOSED FEAURE SELECION RESULS FOR HOUR 1 OF SEPEMBER 1, 2011 IN ONARIO MARKE step Selected feature CMI VE (%) 1 FLG(t-1) FLG(t-2) PF(t) PFE(t-177) PF(t-21) PFE(t-32) PFE(t-19) PF(t-19) PFE(t-175) UB 4 (t) PFE(t-120) PFE(t-153) PFE(t-30) PF(t-189) PF(t-93) method, CMI ranks the features, but SFS selects the best subset of the ranked features, which implements the best compromise between the two competing factors For the test case of able IV, the subset of 13 features, indicated by bold font, is selected by SFS as the best subset leading to the minimum VE Lower or higher numbers of features lead to higher VE By increasing the number of selected features from 15, VE further increases In each step, the feature with the highest CMI is selected he selected features have a decreasing CMI trend but some slight deviations are observed (eg from step 4 to 5), since S and S changes from each step to the next one he first two selected features of able IV, ie FLG(t-1) and FLG(t-2), represent the short-run trend characteristic of FLG time series Also, one UB feature as well as several features from PF and PFE are among the selected features of able IV due to the relevance of FLG with UB, PF and PFE However, FLG is a highly volatile time series as shown in subsection II-B It does not have regular patterns and periodic behaviors as illustrated in Fig 2(c) hus, time lags of the selected PF and PFE features could not be explained based on periodic and regular patterns For instance, load time series usually has daily and weekly periodic behaviors For this reason, time lags of one day or 24 hours ago and one week or 168 hours ago as well as their neighboring time lags are seen in the selected features for load forecast [29], [30] On the other hand, such periodic behaviors are not seen in the FLG time series and so the time lags of the selected PF and PFE features cannot be explained based on these periodicities However, these PF and PFE features really enhance FLG prediction accuracy as shown in able IV For this reason, the SFS mechanism based on VE is added to the feature selection part of the FLG forecast strategy to determine the optimum subset of features for FLG prediction In able V, the test errors (E) of the proposed and seven other feature selection methods for 12 months of year 2011 in PJM and Ontario (ON) electricity markets are compared E measures prediction error or misclassification rate of the proposed FLG forecast strategy for the unseen test samples or forecast samples, such as the samples related to the hours of next day However, validation samples are a part of the historical data that are purposely retained unseen for the feature selection method and forecasting engine to simulate the behavior of test samples In this way, VE can provide an

9 9 ABLE V MONHLY E FOR DIFFEREN FEAURE SELECION MEHODS FOR PJM AND ONARIO (ON) ELECRICIY MARKES IN YEAR 2011 IG CFS Relief-F FCBF DISR MRMR CMIM Proposed PJM ON PJM ON PJM ON PJM ON PJM ON PJM ON PJM ON PJM ON Jan Feb Mar Apr May June July Aug Sept Oct Nov Dec Average estimate for E As E is not available in the real forecast conditions, its estimated value, ie VE, is used to select the best subset of the ranked features through SFS mechanism More details about VE and E can be found in [15], [31] After predicting the FLG classes for 24 hours of each forecast day, the sliding window of the historical data, ie 50 days prior to the forecast day, proceeds by one day hen, the feature selection process and training phase of the ELM forecasting engines are performed by the updated historical data and FLG classes of the next day are predicted hus, hourly samples of the forecast day, for which E is evaluated, are always unseen for the proposed FLG forecast strategy Monthly E in able V measures number of forecast hours with incorrect FLG class prediction among all hours of the month For the sake of a fair comparison, the other parts of the proposed forecast strategy, including the quantization part and ELM classifiers, are kept unchanged for the seven other feature selection methods of able V Among these seven methods, information gain (IG), correlation-based feature selection (CFS), Relief-F, fast correlation-based feature selection (FCBF), and maximum relevancy minimum redundancy (MRMR) are available in the feature selection repository software package [32], and double input symmetrical relevance (DISR) and conditional mutual information maximization (CMIM) are available in the feature selection toolbox [33] For the sake of conciseness, only the best feature selection methods of these two software packages are considered for the comparison of able V, while the other methods of these packages lead to higher misclassification rates for FLG IG method only assesses the relevancy of the candidate features CFS, Relief- F, FCBF, DISR, and MRMR evaluate both relevancy and redundancy of the features CMIM can consider complementarity in addition to relevancy and redundancy for feature selection, similar to our proposed approach However, CMIM is based on pair-wise CMI, while our proposed approach selects features considering the subset CMI In other words, in the proposed method, CMI is evaluated for each feature conditioned on the already selected subset and not only based on conditional pair-wise dependencies as calculated in CMIM Moreover, in the proposed approach, selected subset in each step is validated using the forecasting engine accuracy, which yields to a better generalization capability Due to these reasons, the proposed feature selection approach outperforms all seven other comparative methods of able V he proposed approach not only has the lowest average E among all methods, indicated in the last row of able V, but also has lower misclassification rate than all other methods in all 12 test months for both the electricity markets Only CMIM reaches the same FLG misclassification rate of the proposed approach in April and November test months for Ontario electricity market and July test month for PJM electricity market CMI+SFS feature selection method effectively evaluates information value of each candidate input for FLG forecast process If PF(t) has low information value for FLG forecast process (eg due to high price forecast error leading to the low relevancy of price forecast with FLG), the proposed feature selection method filters it out so that it cannot deteriorate FLG prediction accuracy hus, the proposed FLG prediction strategy can effectively deal with price forecast he average computation time of the proposed FLG forecast strategy, including run time of the quantization component, the execution time of the feature selection parts and training time of the ELM classifiers, was less than twelve minutes for the test cases of this paper his computation time, measured on the simple hardware set of personal computer with 26 GHz dual core processor and 2GB RAM, is acceptable within a day-ahead decision making framework C Application of FLG prediction for improving bidding strategy With the aid of FLG prediction, a GenCo can increase/decrease its generation offer for hours with financial gain/loss (with higher increase/decrease for higher gains/ losses) In other words, a GenCo can improve its bidding strategy to attain higher profit o numerically illustrate this matter, we simply adopt the FLG-based improvement approach for bidding strategy, depicted in able VI he increase/decrease of generation offer is applied to the units of the GenCo according to the priority list, which is constructed based on the marginal cost of units, considering capacity and ramp up/down limits of units Monthly profit of the GenCo with the original and improved bidding strategies, denoted by MP and IMP, respectively, are shown in able VII Also, the improvement percentages (%) are given in this table It is seen that the improvement percentage is always positive for all months in both the markets indicating that the bidding strategy based on the FLG prediction consistently leads to higher profit for the GenCo he improvement percentage is in the range of 06%-272% and 108%-801%