Towards a new methodology for processing scanner data in the Dutch CPI 1

Size: px

Start display at page:

Download "Towards a new methodology for processing scanner data in the Dutch CPI 1"

Egbert Thornton
5 years ago
Views:

1 Towards a new methodology for processing scanner data in the Dutch CPI 1 Antonio G. Chessa, 2 Stefan Boumans 2 and Jan Walschots 2 1 The authors want to thank various colleagues, both at Statistics Netherlands and of other statistical offices, for their continuous support and discussions. The authors also thank Professor Bert Balk for his comments on a previous, different version of the paper. The views expressed in this paper are those of the authors and do not necessarily reflect the policies of Statistics Netherlands. 2 Statistics Netherlands, Team CPI; P.O. Box 24500, 2490 HA The Hague, The Netherlands. Correspondence can be sent to: ag.chessa@cbs.nl 1

2 Abstract This paper presents a new methodology for processing electronic transaction data and for calculating price indices, with the aim of reducing the methodological differences across retailers and consumer goods in the Dutch CPI. Articles (GTINs or EANs) are combined into homogeneous products, each of which is defined by a set of article characteristics. Articles that share the same characteristics form a product, which should capture price increases associated with relaunches (EAN changes). Two methods for selecting article characteristics are described, which have given the same outcomes. The new index method calculates price indices as a ratio of a turnover index and a weighted quantity index. Weights of homogeneous products are calculated from prices and quantities of multiple periods. Product weights are updated each month in order to incorporate new products timely into the index calculations. Results show that their contribution to a price index may be significant. The method does not lead to chain drift and does not require price imputations. The new methodology is intended to replace the current sample-based methods in the CPI for a department store and for mobile phones in January Results show clear improvements over the current methods and the previously used survey-based methods. Keywords: Scanner data, CPI, GTIN/EAN, relaunch, product homogeneity, index theory. 2

3 1. Introduction Scanner data have clear advantages over traditional survey data collection, notably because such data sets offer a better coverage of articles sold, sales data offer complete transaction information (prices and quantities), and the data collection process is automatised. In spite of their potential, scanner data are still used by a small number of statistical agencies in their CPI, but the number is likely to increase during the coming years. 3 By scanner data we mean transaction data that specify turnover and numbers of articles sold by EAN (or GTIN, barcode). At the time of introduction in the Dutch CPI in 2002, scanner data involved two supermarket chains. In January 2010, the data were extended to six supermarket chains, as part of a re-design of the CPI (de Haan, 2006; van der Grient and de Haan, 2010; de Haan and van der Grient, 2011). At present, scanner data of 10 supermarket chains are used and surveys are not carried out anymore for supermarkets since January Beside supermarkets, scanner data from other retailers are used since January Other forms of electronic data containing both price and quantity information are obtained from travel agencies, for fuel prices and for mobile phones. More than 20% of the Dutch CPI is now based on electronic transaction data (in terms of Coicop weights). The shift from traditional price collection to electronic transaction data has introduced new possibilities for developing index methods. Ideally, we would like to develop a method that makes use of both prices and quantities, and that processes the transactions of all EANs instead of taking a sample. 4 With thousands of EANs per retailer the question is how to find efficient solutions. This has turned out to be a complex process over the years, as the current methods in the Dutch CPI differ across retailers and consumer goods. The current method for supermarket scanner data intends to process all EANs, but evidences different index related issues. The methods for other retailers make use of samples of articles. As the search for new electronic data sources will continue, the question has been put forward whether a generic index method could be developed that is applicable to different types of consumer goods and sources (retailers), and that is capable of handling issues that are not resolved in a fully satisfactory way so far in certain methods (amongst which the relaunch problem and, related to this, the definition of homogeneous products). Such a method could then also be gradually applied to data sets that are currently in production, so that the differences in methods used for different retailers can be reduced. Section 2 gives an overview of the index methods that have been developed for different electronic transaction data and retailers over the past years in the Dutch CPI. An outline of a new methodology for processing electronic transaction data is presented in Section 3. The intention of this section is to show how the new methodology fits within the CPI system. The aim of the methodology is twofold: (1) to process all EANs, thus abandoning the traditional approach of selecting a basket of goods, and (2) to have an index method that deals with the dynamics of an assortment over time, in which new goods are timely included, and that efficiently handles relaunches. 3 In Europe, six countries will be using scanner data in The scanner data workshops in Vienna (2014) and Rome (2015) evidenced that several countries are expecting their first data, while other countries made concrete steps towards acquiring their first scanner data. 4 In this paper, the term article and EAN/GTIN are used interchangeably. 3

4 Sections 4 and 5 elaborate on two essential aspects of the new methodology. The relaunch problem implies that EANs are not always appropriate as unique identifiers of homogeneous products. Product homogeneity should then be achieved at a broader level, at which EANs are combined into groups. Homogeneous products or EAN groups could be defined by combining EANs that share the same set of characteristics. These have to be selected in some way. Section 4 proposes and compares two selection methods. Homogeneous products constitute one of two intermediate article group levels between the EAN level and the lowest publication level (L-Coicop). 5 At the product level, turnover and quantities of articles sold are summed and used to calculate unit values per product. These are used to calculate price indices for so-called consumption segments, which combine different homogeneous products (e.g., a segment T-shirts with underlying products that are described by one or more characteristics). The index method that has been developed for this purpose is described in Section 5. A price index for a consumption segment is calculated as the ratio of a turnover index and a weighted quantity index. The essential element of the index method is the definition and calculation of the weights of the homogeneous products, such that new products can be directly included. Price indices at Coicop levels are calculated according to conventional Laspeyres type methods. Some results obtained with the new methodology are presented in Section 6. Section 7 summarises and concludes with short-term plans with the methodology. 2. Historical overview of methods for processing scanner data The search for efficient processing methods and index methods for scanner data has proven to be a challenge throughout the years. Different directions have been investigated for supermarket scanner data, which may be useful to share with statistical agencies that are thinking about using scanner data or are about to use scanner data in their CPI. Over the past years, three methods were developed at Statistics Netherlands for supermarket scanner data, which are referred here to as versions 0, 1 and 2. The choices and decisions made for these methods are summarised in Table 1. The first method was proposed at the end of the 1990s. EANs were considered to be a natural choice for homogeneous products. The availability of weekly turnover data led to suggesting the Fisher index as an ideal index. Because of frequent assortment changes a monthly chained Fisher index was considered. However, the price indices showed a strong downward drift. As a consequence, this method was never implemented. Based on these experiences, a switch was made to a method that used a (large) basket and a Laspeyres index with yearly fixed weights. This method was implemented, but article replacements turned out to be very time consuming as the dynamics of assortments increased. After seven years it was decided to develop a less labour intensive method. 6 The third method (version 2) is the method currently used for supermarket scanner data. The idea of using a basket of goods was put aside, and a return was made to processing all transactions. This decision led to adopting a monthly chained index again, like in version 0. But in the light of the experiences with a strong drifting behaviour, an 5 In the current Coicop classification, L-Coicops are specified at the fifth digit level at most (depending on division). 6 In classical surveys, consumer specialists define representative products within Coicops. Overall, the Dutch CPI contains between 0 and 1500 of such products. In method version 1, a basket of about 10,000 EANs per retailer was used (Table 1), which makes manual replacement of articles much more difficult. 4

5 index method was chosen that assigns equal weights to EANs. A drawback of this choice is that EANs with negligible turnover would receive a relatively high weight. A turnover threshold was therefore introduced in order to exclude such EANs from the index calculations (see de Haan and van der Grient (2011) for more details). Table 1. Methods developed at Statistics Netherlands for supermarket scanner data. Choices and decisions Version 0 Version 1 Version 2 Developed/used in Late 1990s present Sample All data Basket All EANs that satisfy (± 10,000 EANs certain filters per retailer) Homogeneous products EANs EANs EANs Replacements No Yes, manually No (EANs with large turnover share) Index method Monthly chained Laspeyres, with Monthly chained Jevons, Fisher index yearly fixed weights with equal weights for 'accepted' EANs; Laspeyres for (L-)Coicops Implemented? No Yes Yes Monthly chained Jevons indices are calculated for consumption segments. 7 The price indices for consumption segments are aggregated to Coicop levels by applying a Laspeyres index with fixed weights for each consumption segment, which are revised each year. The impact of equal weights is thus confined to consumption segments. An important issue with the current method is that relaunches are not handled in a satisfactory way. Price increases after relaunches are not captured. A so-called dump price filter is introduced in order to prevent price indices from serious downward biases. Strong price decreases for EANs that are about to be replaced, in combination with low quantities sold, are filtered out from the calculations. Other practical issues are that the filter settings need to be tested for each new data set, and possibly also repeated tests need to be done in the future in order to verify their validity. Beside supermarkets, scanner data and other electronic transaction data sets are used for other retailers. Table 2 shows a complete overview of the retailers and types of consumer goods. More than 20 per cent of the entire CPI is based on electronic transaction data (in terms of Coicop weights). The index methods for retailers other than supermarkets make use of a sample of EANs. Like in traditional surveys, a limited number of products is defined, for instance, by picking a small number of brands based on turnover share. EANs are combined into products based on brand and one or two additional characteristics, such as package 7 In this context, consumption segments are the internal scanner data aggregates in Dutch CPI jargon (abbr.: ISBAs). These ISBAs are our own harmonised reclassification of the retailers own classifications of EANs (called ESBAs), which may differ across supermarket chains. 5

6 content. Laspeyres indices with yearly fixed product weights are used for these retailers. The primary focus in these methods was to capture price increases after relaunches. Table 2. Use of scanner data and other types of electronic transaction data compared to survey data, by retailer/consumer goods in the Dutch CPI, as percentages of the sum of the Coicop weights in Retailers Transaction data Survey data Supermarkets* 12.9 Do it yourself stores* Department stores* Drug stores* Travel agencies 1.7 Fuel 3.6 Mobile phones 0.5 Other 78.0 Total * Scanner data, i.e. transaction data specified by EAN/GTIN A drawback of the methods for non-supermarkets is the small size of the samples. In addition, products might not be homogeneous enough. We come back to these issues in Section 6. Given the diversity of methods across retailers and consumer goods, and the different potential problems identified in this section, a study was set up last year with the aim of finding a more generic method that is ideally applicable to all transaction data sets, does not draw samples but processes all transactions, and that resolves the issues mentioned above. 3. Outline of a new processing framework The introduction of different methods for different retailers in the CPI has made the system increasingly complex over time. New choices were made each time a new data set was added to the production system. The current index method for supermarkets makes use of different types of price and turnover filters, which have to be regularly checked and tested for each new chain. The methods for non-supermarkets make use of samples, which need continuous monitoring as their turnover shares might become less significant due to new developments in the assortments. For these reasons, the possibilities of developing a generic method have been studied in order to reduce the methodological differences between retailers and types of consumer goods. The new methodology focuses on two main problems: How could a relaunched article and its predecessor be combined, that is, be considered as equivalent instances of the same set of articles ( homogeneous 6

7 product )? How could homogeneous products be defined, and how could articles be matched in an efficient way, without time consuming manual interventions? What kind of index method could be developed, such that all transactions can be processed and new articles can be directly included? The problem of product homogeneity and the index method developed for the above purposes are treated in sections 4 and 5, respectively. In the present section we sketch how these components are intended to be integrated into the full CPI production system, based on our experiences so far with the tests performed with the new methodology. The processing of electronic data in the Dutch CPI can roughly be subdivided into four stages: 1. Reading and checking data; 2. Linking articles/eans to (L-)Coicops; 3. Calculating prices and price indices at lower aggregate levels ; 4. Calculating price indices for (L-)Coicops and for the CPI as a whole. The first stage consists of reading data files and performing basic checks on the data, such as the correctness and completeness of records and record variables and controlling for quantities sold with value zero (which are isolated before article prices are calculated). The subsequent three steps are worked out in more detail in the chart of Figure 1. The lower aggregate levels mentioned in step 3 consist of three levels, which are explained below. Figure 1. Nested group levels of individual articles in the CPI, and price definitions and price index calculations at these levels in the new methodology. Article groups Index calculation (L-)Coicops Laspeyres type indices Retailer's ESBAs Consumption segments QU-indices Homogeneous products Product prices (unit values) Individual articles/eans Transaction prices Article group levels Consumer goods and services are subdivided in the CPI into Coicops. The most detailed level of publication within Coicop divisions is referred to as L-Coicop in the Dutch CPI. Scanner data contain transaction data at EAN level. For reasons explained previously, a 7

8 further subdivision is made between L-Coicop and EAN level. Individual EANs may have to be combined into groups, which we refer to as homogeneous products. These products and their underlying articles need to be linked to L-Coicops, which has to be done in an efficient way if we aim at processing all EANs. For this purpose, it is important to ask retailers for their own classification of EANs (called ESBAs in our system). Usually, we take the most detailed ESBA level for establishing the EAN-Coicop links. However, the most detailed ESBAs may still cover more than one L-Coicop, so that we need to define an intermediate level between L-Coicops and homogeneous products. This intermediate level, which we call consumption segments, may be derived from more detailed EAN characteristics (more details are given in Section 4). 8 In our first tests with the new methodology, we have chosen consumption segments to be types of article. Examples of segments are men s T-shirts, ladies cardigans or chocolate. Each of these segments contains a set of homogeneous products. For T-shirts, a product may contain EANs that have the same number of items per package, the same sleeve length, fabric and colour. In this way we obtain a nested partition of individual articles/eans at different levels, as is shown in Figure 1. The question is how products can be defined, which article characteristics to select and what methods could be considered for this purpose. This will be treated in Section 4. Calculation of price indices At each level in Figure 1 we either need to define prices or establish the method for calculating price indices. The price of an individual article is its transaction price, that is, turnover divided by number of articles sold (in fact, this is a unit value at EAN level). For homogeneous products we use the same definition. Both turnover and quantities sold are summed over articles that share the same set of selected characteristics. The ratio of these two sums for groups of articles is usually referred to as a unit value. Unit values and quantities sold for homogeneous products are subsequently used to calculate a price index for each consumption segment. A new index method is developed for this purpose, referred to as the QU-method, which is described in Section 5. Price indices for consumption segments are then aggregated to L-Coicops according to traditional Laspeyres type indices, with weights based on turnover of the preceding year Consumption segments and product homogeneity In order to make choices about consumption segments and homogeneous products, statistical agencies should ask retailers for information about article characteristics and article classifications used by retailers for their own purpose (ESBAs). Information about article characteristics may be contained in article descriptions and also in detailed ESBAs. Our experiences with electronic data sets are that this information may be supplied in varying formats by different retailers. For instance, the record variables in drugstore scanner data are all contained in separate columns (Chessa, 2013). Information about 8 Consumption segments play the same linking role as the ISBA classification in our current system. 9 Aggregation to Coicop levels could also be carried out by applying the aggregation according to the QU-method, by summing turnover and weighted quantities in the numerator and the denominator of index formula (1) over consumption segments (see Section 5.1). This is a more consistent aggregation method. Preliminary research has shown that the differences between the two aggregation methods are very small at L-Coicop level (Chessa, 2014). As a consequence, we have decided to stick to the traditional way of aggregating to Coicop levels. 8

9 article characteristics may also be exclusively contained in text strings with EAN descriptions. The first example is clearly the preferred data format, as consumption segments and products can be derived immediately, and EANs can be automatically assigned to both article group levels and linked to Coicop. In the second case, some form of text mining will have to be applied in order to retrieve and place information about article characteristics in separate columns. Text mining falls outside the scope of this paper and will therefore not be treated further. Consumption segments are defined as sets of homogeneous products. We have taken consumption segments to be equivalent with types of article. Article types can be defined at different levels of detail (e.g., socks as a whole, or sports, thermal and walking socks as separate types). In our first tests with department store scanner data, we have defined article types at the most detailed level of the two mentioned in the example (i.e., sports, thermal and walking socks). We could say that we differentiate article types, and hence consumption segments, according to purpose. Defining tighter consumption segments increases the chance of index imputations. We come back to this point in Section 6, where a test case is presented. An advantage of defining tighter consumption segments is that price indices are purer in some sense. The new index method computes weights per unit of product sold (Section 5). In our socks example, product weights for walking socks are merely based on prices for that article type and are not mixed with price information and the price index for other types of socks. In order to avoid potential problems with relaunches, we combine articles into homogeneous products. This could be done by selecting a set of relevant article characteristics and combine those articles into groups (products) that share the same characteristics. A walking sock product could contain articles with two pairs of socks per package, with colour brown and a specific type of fabric. The question is which article characteristics to select and what methods could be used for this purpose. Before proceeding, we introduce the following terminology. By characteristic of an article we refer to an instance, a specific value that an article can take. Such a value belongs to a broader set, which we refer to as attribute. For example, white is a characteristic of a T-shirt that belongs to the article attribute colour. The selection of article attributes is traditionally part of the consumer specialist s domain. Consumer specialists define products by selecting specific characteristics, for instance, a specific brand name, package content and a consumer target group. We could build on this idea when using scanner data and supplement it with a sensitivity analysis. A more technical method could be suggested as an alternative, which belongs to the field of statistical model selection (Claeskens and Hjort, 2008). We briefly describe both approaches, which will subsequently be compared with an example. Sensitivity analysis The selection of article attributes covers three stages in this approach: 1. For a given consumption segment, the consumer specialist selects a number of article attributes that (s)he finds to be relevant. This gives rise to an initial set of products; 2. A price index is calculated for the consumption segment, according to the method of Section 5; 3. A sensitivity analysis is performed: an attribute that was not selected in step 1 is now added and the price index is re-calculated. If the price index changes significantly, 9

10 then the attribute is accepted. This step can be repeated with other attributes. Attributes may also be omitted when their impact on the price index is negligible. A meaning has to be given to what is found to be a significant impact on a price index after adding or omitting attributes in step 3. What is found to be an acceptable tolerance? This may be harder to establish for consumption segments than at Coicop level. Experience has to be gained with this aspect when the methodology is taken into production. Statistical model selection The point of departure in this approach is a stochastic model for the price of an article. Following the terminology adopted for the QU-method in Section 5, the price of an article is decomposed into a product-specific component (the product to which an article/ean belongs), a time-specific component (price index with respect to a base month) and some random (residual) term. Different sets of attributes may give rise to different numbers of products, and thus different numbers of product-specific parameters. The problem is to compare different model versions, with different numbers of free parameters. This can be done by calculating information criteria for each model version, which represent a class of statistical fit measures that consist of two components: Under suitable assumptions, the first main term simplifies to a sum of weighted least squares. The squared differences compare the article transaction prices to the model prices (or a logarithmic transformation of both prices, depending on the type of model); A term that involves the number of free parameters in a model. The second term acts as a penalty term in assessing model fits. Adding parameters to a model may decrease the weighted least squares term, but also increases the risk of overfitting. The aim is to select a model version that balances model fit against model complexity (number of parameters). The model version with the optimum value of the information criterion is finally selected. An advantage of the first method over the second is that it is less technical and, as a consequence, is expected to be easier to understand among CPI users. An advantage of the statistical method is that it can be automatised, although this should also be possible with the first method. The statistical method does not need to specify a tolerance level, like the first method. It simply selects that model version, with its corresponding set of homogeneous products, that optimises the information criterion. However, there are different types of information criteria, which differ by the size of the penalty term. This raises the obvious question which one to select. Based on experiences with scanner data, Chessa (2015) suggests to use the Bayesian Information Criterion. A consideration that is not mentioned above is that article attributes may be treated in different ways. For instance, each package content may be treated as a different value, thus giving rise to a different product, or different contents give rise to the same product with quantity-adjustments applied to each. 10 It is beyond the scope of this paper to treat 10 This is what we decided in our current method for drugstore scanner data. However, we defined content thresholds in order to keep single item articles separate from multipacks, which are treated as different products. 10

11 this aspect in detail, but it is important to be aware of the fact that different treatments can be given to article attributes. It is advisable to compare their impact on a price index. Example The new methodology has been tested on scanner data of a department store. Historical data from the period February 2009 until March 2013 have been used to define homogeneous products for consumption segments in different Coicops and to test the index method of Section 5. We take menswear and ladies wear as examples to illustrate the above two methods for defining products. In the traditional survey for the department store, the consumption segment T-shirts was characterised by the consumer specialist in terms the following attributes, for both ladies and men: Number of items in a package; Sleeve length; Colour. Fabric was not mentioned explicitly, possibly because the consumer specialist had decided to use an article description with one specific fabric in mind (e.g., standard cotton). Suppose we exclude fabric initially. Adding fabric as a fourth attribute in a sensitivity analysis has a very small impact on the price index for men s T-shirts, but a large impact is found for ladies T-shirts. Fabric should therefore be added to the list of attributes for T- shirts (which could also be done for both men and ladies in order to limit the number of different lists of attributes across consumption segments). Figure 2 shows the price indices for men s and ladies wear for the department store before and after adding fabric for T-shirts. There is hardly any difference at the L-Coicop level for menswear, while a small difference can be noted for ladies wear (due to the smaller weight of T-shirts compared to other consumption segments). The sensitivity analysis gives the same results as the statistical method. For the consumption segments other than T-shirts, the statistical method even resulted in the same selection of attributes as was specified by the consumer specialist in the traditional survey. Based on these results, it may be advisable to start experimenting with the simpler sensitivity analysis when the new methodology goes into production. Figure 2. Price indices for menswear and ladies wear for a Dutch department store, before and after adding fabric as an extra attribute for T-shirts (Feb = ). Menswear Ladies' wear Fabric added for T-shirts Survey-based choices Fabric added for T-shirts Survey-based choices Our primary objective to characterise homogeneous products in terms of a set of attributes does not necessarily imply that we exclude EANs as product choices in any case. 11

12 Article characteristics are contained in the EAN descriptions (text strings) for the department store scanner data. A considerable part of the EAN descriptions only consists of one to three terms. Defining homogeneous products with such a small number of attributes may introduce a risk of missing important attributes. In such cases, one should first contact and visit the retailer and discuss the problem. Additional information could also be collected through web scraping. But under certain circumstances it is not needed to collect more information about article characteristics, which is illustrated by the following examples. Figure 3 shows the price indices at two levels of product differentiation for the restaurant part of the department store and for kitchen textiles. Both article assortments are stable over time, so that calculating price indices with EANs as products does not lead to problems caused by relaunches. Product differentiation according to a limited set of attributes (article type, size and taste for the restaurant, and only article type and colour for kitchen textiles) gives comparable results. No difference was even found for the restaurant. In cases with short EAN descriptions and stable assortments, the choice of EANs as homogeneous products can be defended. Moreover, it does not require the extraction of article characteristics from EAN descriptions, which has been rather time consuming for the department store data. Figure 3. Price indices for the restaurant and for kitchen textiles of the department store, for two levels of product differentiation, as calculated with the index method of Section 5 (Feb = ) Restaurant Kitchen textiles Products based on attributes EANs as products Products based on attributes EANs as products 5. An index method for consumption segments 5.1 Price index formula Once consumption segments and the homogeneous products in each segment have been defined, the question is according to what method price indices could be computed. The following aspects were considered in our choice of index method: The index method should be able to incorporate new products in the month of introduction into the assortment; Based on our first experience at Statistics Netherlands with scanner data (Version 0, see Table 1 in Section 2), the method should not suffer from chain drift; A price index should simplify to a unit value index when all products are homogeneous. 12

13 Before giving the formulas behind the index method, we introduce some notation. Let and denote sets of homogeneous products in some consumption segment G in periods 0 and t. The sets of homogeneous products in 0 and t may be different. Let, and, denote the prices and quantities sold for product, respectively, in period t. 11 We denote the price index in period t with respect to, say, a base period 0 by. The following formula is proposed for calculating price indices (Chessa, 2015):,,,,,,. ( ) The numerator is a turnover index, while the denominator is a weighted quantity ( volume ) index. The product specific parameters are the only unknown factors in formula (1). Choices concerning the calculation of the are described in Section 5.2. Price index formula (1) can be written in the following compact form:, ( ) where and denote weighted arithmetic averages of the prices and the over the set of products in period t, that is,, respectively,,,,, (3),,. ( ) Notice that the numerator of (2) is equal to the unit value index, where unit values are defined as the ratio of the sum of turnover and the sum of quantities sold over a set of products in a consumption segment, as given by (3). If the products in a consumption segment are homogeneous, then the of all products have the same value. In this special case, price index (1) simplifies to a unit value index, a property that we imposed on the index method. 12 In the more general case where a set of products is not homogeneous, then the unit value index must be adjusted. Price index formula (1) gives a precise expression for the adjustment term, which is the denominator of (2). This term captures shifts in consumption patterns between different periods. A shift towards products with higher weights ( quality ) results in an upward effect on the volume index and, consequently, in a complementary downward effect on the price index. As the method adjusts for shifts between products with different quality, we call index ( ) a quality adjusted unit value index ( QU-index for short). The fundamental question is how values for the could be obtained. This will be discussed in the next subsection. 11 A different notation is used in this paper from the commonly accepted notation of time as a superscript in prices, quantities and indices. In this paper, preference is given to the notation of both product and time indices as subscripts. This was done in order to reserve the superscript for other purposes (see Chessa (2015), Section 2.3). 12 Time-product dummy and hedonic methods (e.g., see de Haan and Krsinich (2014)) do not satisfy this property. 13

14 5.2 Choices concerning the In order to find a method for calculating the product specific weights on the following questions:, we focused How could new products be timely incorporated into the index calculations? The appear in index formula (1) as factors that depend on product, but not on time. Are the constant in time, or can their values be allowed to vary over time? In the latter case, how long should we take the periods on which the are constant? How could chain drift be avoided? Inclusion of new products Formula (1) can be considered as a family of price indices. Special cases can be derived for specific choices of the, which are worth mentioning and are helpful to fix ideas. If we set the equal to the product prices from the publication period t, then it is easily verified that price index (1) simplifies to a Laspeyres index. If the are set equal to the prices of the products sold in the base period 0, then (1) turns into a Paasche price index. The use of price and quantity information from both periods leads to a Lowe type of index in the QUmethod:, (,,, ), (,,, ), ( ) where is the harmonic mean of the quantities sold in the two periods. The three special cases are not able to take into account new products in the year of publication, unless some form of price imputation is carried out. Monthly chaining would be an alternative, but given the problems experienced at Statistics Netherlands with scanner data in the first years, this option is not considered here. Price imputations are not needed if the are based on price and quantity information from multiple periods. This is an essential property of the method, which can be exploited to obtain transitive price indices. Considering product prices and quantities from some period T, we define for product as follows:,,, ( ) where,,, ( ) denotes the share of period z in the total amount of quantities sold for product over period T. Two remarks are worth making: First, it is clear that a choice for the length of the period T must be made, which will be dealt with later in this subsection; Second, the are defined as a weighted average of deflated prices observed in T. The effect of price change is thus removed in order to yield product specific in the volume index of (1). The price index to be calculated also appears in the, which in turn are needed to calculate the price index. In Section 5.3, a computational method is presented that deals with this characteristic of the method. 14

15 The index method ( QU-method ) is completely described by formulas ( ), ( ) and (7). This system of expressions has a counterpart in the field of PPPs, in a method that is known as the Geary-Khamis (GK) method (Geary (1958), Khamis (1972), Balk (1996, 2001, 2012)). The GK-method has been the subject of some debate, essentially because of the linear form proposed for the (e.g., see Balk (1996), p. 214). Alternative forms for (6) are considered in Chessa (2015). These departures from linearity ( perfect substitution ) were shown to have negligible effects on the price indices analysed, so that it was decided to stick to the simpler expression (6). Apart from its simplicity, expression (6) also has another appealing feature. A straightforward rewriting of (6) gives:,,,. ( ) Expression (8) says that is equal to turnover in constant prices of product over period T, divided by the total number of products sold in the same period. The numerator in (8) coincides with the notion of volume as used in national accounts. In this sense, can be defined as volume per unit of product sold. Returning to our proposal of using prices and quantities from multiple periods for calculating the, we conclude by comparing some QU-indices with Lowe type index (5), which is calculated both as a direct index and as a monthly chained index. Figure 4 compares the three price indices for four types of menswear. 13 The results show large differences, which have different causes. While the direct method and the QU-method give comparable results for socks and underwear, the direct method fails for T-shirts. Figure 4. QU-indices compared with direct and monthly chained indices (MoM) for four types of menswear, based on scanner data of a department store (Feb = ) Socks Underwear QU-index Direct index MoM index T-shirts QU-index Direct index MoM index Pullovers and Cardigans QU-index Direct index MoM index QU-index Direct index MoM index 13 In the results shown, T-shirts represent a consumption segment, while the results are aggregated over various consumption segments for the other three article groups. 15

16 The direct method does not capture the contribution of new products to price change in the year of introduction to the assortment. New types of T-shirts, made of organic cotton, were introduced in 2010 at high initial prices, which already started to decrease in The contribution of the new T-shirts is captured by the QU-index, and also by the monthly chained index, but not by the direct index. The latter only evidences the price behaviour of the existing part of the assortment, which, in contrast to the new articles, shows a price increase in The monthly chained indices do not come close to the QU-indices in none of the four cases. This can be partly explained by seasonal effects (e.g., articles returning into the assortment with price increases, which are missed). The examples in Figure 4 show that it is important to have an index method in which not only existing articles enter the calculations, but also new articles. Leaving out new articles in the year of introduction may have a huge impact on a price index. This implies that should be calculated for new products as soon as these appear in an assortment. 14 Length of the time window T Scanner data of a large department store and of drugstores, and also electronic data of mobile phones, have been used to compare time windows that vary between 1 and 4 years in length. Different window lengths were compared by calculating information criteria (Section 4). A unique choice is not easy to make, as different results have been obtained for different types of goods. One-year windows turned out to give slightly better fits for the department store scanner data. Longer windows tend to show better fits for drugstore scanner data, but the differences among the price indices for different window lengths are negligible in most cases. The same holds for mobile phones. A 1-year window fits well with current practice in the Dutch CPI and is advantageous with regard to system maintenance compared to longer windows, as only items sold within one year have to be followed. The problem of chain drift We intend to use one-year windows T and the corresponding to calculate price indices according to expression (1) per publication year. We will do this by using December of the preceding year as a fixed base month. The choices on the and the length of the time window T give rise to index numbers that are transitive in publication years, and are therefore free of chain drift. It is obvious that we need to switch from one set of s to another in the next publication year, a choice that has been motivated by statistical research on historical data, as mentioned previously. The method described in Krsinich (2014) offers an alternative to a method with a fixed base month. She also suggests one-year windows, but proposes to shift the window along with each publication month (quarter in New-Zealand). Price indices for publication periods are calculated by chaining year-on-year indices at each shift of the window. The use of her rolling window approach may lead to large differences compared to our method. Different questions and issues have emerged from the comparison with the rolling window approach. It is not clear how the price indices for different publication years are related to each other. One would expect the rolling window approach to produce results 14 However, note that new products contribute to a price index from the second month in which they are sold. 16

17 that are comparable with our method, but with the choice of base month averaged out. Apparently, this is not the case (Chessa, 2015). As a final note, it may be interesting to mention that we found a price model with December as a fixed base month to give structurally better fits than a model based on the rolling window approach (i.e., the former choice resulted in smaller weighted least squares values). This finding coincides with current practice in the CPI/HICP, as December is the month in which yearly weight revisions are carried out. 5.3 Computation of price indices in practice In order to incorporate new products in a timely way in the index calculations, we calculate from prices and quantities of the publication year. At this point we encounter a problem, since we decided upon a model with yearly constant. We refer to the corresponding index as the theoretical benchmark index. The complete set of annual prices and quantities becomes available only in the final month of a year. This raises the question how to deal with this problem. We propose a method for calculating a real time version of the benchmark index, such that: The are updated each publication month with product prices and quantities which become available in that month; Price indices are calculated with respect to the base month, by making use of the updated. That is, a direct index is used instead of a monthly chained index. These choices ensure that the benchmark and real time indices are equal at the end of each year, so that real time indices are free of chain drift as well. This is an important property of the computational method. The question is how the two price indices compare in previous months. This will be illustrated in Section 6 with several examples. Price indices cannot be calculated directly, since the depend on the price indices. We propose a simple method, which follows an iterative scheme: 1. Suppose that a price index for publication month t has to be calculated. As a first step, choose initial values for the price indices from base month 0 up to month t; 2. Calculate the for each product sold between the base month and month t by making use of product prices and quantities up to month t:,,, ( ) where,,,. ( ) 3. Substitute the obtained in step 2 into expression (1) and calculate updated price indices up to month t; 4. Repeat steps 2 and 3 until the differences between the price indices obtained in the last two iterations are small, according to some pre-defined distance measure. A number of comments need to be made: 17

18 The initial values for the price indices in step 1 can be chosen arbitrarily, for instance, as the algorithm can be shown to converge to a unique solution (when such a solution exists, of course); Computation times can be reduced by constructing suitable initial price indices. A method with this aim is described in Chessa (2015), which has shown that the initial indices already give very good approximations to the final indices; The in step 2 are calculated by making use of product prices and quantities from the base month up to the publication month t. This means that a shorter period is used at the beginning of each year. As an alternative we could use a moving one-year window and include data from the preceding year. The results obtained with the above choices have been satisfactory, as will be shown in Section 6. Therefore we stick to the method presented above so far, which is simpler to implement; Price indices are calculated for each month between the base month and the publication month. However, the price indices up to month t 1 will not be revised, as this is not common practice in the CPI (apart from exceptional cases). This means that only the price index for the publication month will be retained from the calculations, which hence will not be modified in successive months. 6. First tests with the new methodology The methodology presented in this paper has been studied and applied to scanner data of a department store and drugstore chains, and also to transaction data of mobile phones. It has been implemented for the department store and for mobile phones in a test environment. The methodology has been programmed in T-SQL for the department store. It was tested in Excel for mobile phones, as the data set is small (70 devices make up almost % of total turnover). However, the method will be programmed in T-SQL as well for mobile phones in the near future. In this section, some first findings are reported with the methodology from tests for the department store. The following facts show that we are dealing with a big retailer. The data contain over 133,000 EANs for 2014, which are subdivided into 221 ESBAs according to the retailer s article classification. An ESBA gives rise to one or more consumption segments. The article assortment covers eight Coicop divisions. The department store has a weight of 0.7 per cent in the Dutch CPI (see also Table 2, Section 2). One of the key properties of the data is that information about article characteristics is bundled into text strings. This means that we had to find a way to restructure the data into a format, such that the price index related problems could be efficiently dealt with. A method for identifying article characteristics in text strings had to be developed, after which the characteristics could be placed in separate columns. A basic form of text mining was used to identify article characteristics. Lists of key words were set up for the characteristics, based on the coding used by the retailer. The initial stage of this process consisted of a visual inspection of text strings in order to obtain a first impression of the coding system. The key word lists were gradually expanded by isolating the EANs that still did not match the current key words, which could indicate a different coding for the same characteristic or no coding at all. This data processing step proved to be time consuming, especially at the beginning. A retailer may use different ways of coding the same characteristic over time. For instance, 18

we encountered both Dutch and English terms for the same characteristics, and terms were both spelt out and abbreviated, the latter even in different ways.

For instance, single-item packages never have the content of a package mentioned in EAN descriptions, which, however, is specified for the number of items in multipacks.

Some degree of interpretation of the EAN descriptions can thus not be excluded when applying text mining.

Article attributes were selected and homogeneous products were defined by applying the statistical approach described in Section 4.

19 we encountered both Dutch and English terms for the same characteristics, and terms were both spelt out and abbreviated, the latter even in different ways. Missing information about characteristics can be interpreted in different ways. A characteristic may not be mentioned when it concerns a default value. For instance, single-item packages never have the content of a package mentioned in EAN descriptions, which, however, is specified for the number of items in multipacks. For data sets like the scanner data for the department store, one therefore has to try to imagine the logic behind the retailer s coding rules. Some degree of interpretation of the EAN descriptions can thus not be excluded when applying text mining. The results obtained with text mining were subsequently used to address the problems related to the calculation of price indices. Historical data over a four-year period were used for this purpose. Article attributes were selected and homogeneous products were defined by applying the statistical approach described in Section 4. Choices with regard to the QU-method were made as well, which are motivated in sections 5.2 and 5.3. These choices were taken into the implementation and test phase. Price indices have been calculated according to the scheme in Figure 1 (Section 3), which is worked out in more detail in Figure 5 below. Figure 5. Process steps in the implemented methodology for the department store. Define consumption segments Define homogeneous products Calculate price indices Link article types to retailer's ESBAs 'High' risk of relaunches? Real time QUindices for consumption segments YES NO Link article types to EANs in ESBAs Link article characteristics to EANs Laspeyres indices for (L-)Coicops Consumption segment = type of article Product = Combination of characteristics = Group of EANs Product = EAN The linking of article types and characteristics to EANs, as mentioned in the chart, is achieved by searching the text strings for corresponding key words. The chart is in fact generally applicable to any transaction data set. What makes the chart specifically tailored to the department store data is the possibility to take EANs as homogeneous products. As was stated in Section 4, this choice was made for consumption segments with short EAN descriptions and assortments that are stable over time (no or hardly any relaunch). This choice can be easily made for Coicop divisions 01, 02 and 11, at least, in case of the department store. For other Coicops, notably for clothing articles, products are defined in terms of a limited set of attributes (four at most). A question that comes to mind when looking at the chart in Figure 5 is what to do with EANs with short descriptions and frequent relaunches. We have not come across such cases in the department store data. But if such cases would emerge in other data sets, then some decision must be made. Such cases could be left out when their turnover share 19