opensap Getting Started with Data Science

Size: px
Start display at page:

Download "opensap Getting Started with Data Science"

Transcription

1 opensap Getting Started with Data Science Exercise Week 1 Unit 6 Initial Data Analysis & Exploratory Data Analysis

2 opensap TABLE OF CONTENTS INTRODUCTION... 3 EXERCISE INSTRUCTIONS... 4 Acquire Data... 4 Visualize Room... 7 Create Geographical Hierarchy... 8 Data Visualizations Descriptive Statistics FURTHER READING

3 INTRODUCTION These exercises are designed to introduce you to some of the methods we can use to undertake Initial Data Analysis using SAP BusinessObjects Predictive Analytics expert tool. The data to be used is opensap_stores_us.csv. This data set is a short list of US based retail stores. The data set contains the following variables: STORE (US City location of the store) TURNOVER (annual sales for the previous 12 month period for each store $000000) SIZE (size of retail floor space in 000s of sq. ft. for each store) STAFF (number of staff members in 10s) MARGIN (total gross margin per store $00000) There are 5 columns of data and 150 rows. The columns represent the variables defined above, and the rows represent the values for each of these variables for each individual store. The exercises show a variety of different visualizations you can use to gain a deeper understanding of the data and undertake an Initial Data Analysis. 3

4 EXERCISE INSTRUCTIONS Acquire Data Open SAP BusinessObjects Predictive Analytics Select Expert Analytics: Open Expert Analytics: 4

5 Select Acquire Data: For this exercise, the data set we will be using is the opensap_stores_us.csv text data. Therefore, select Text as the data source and then press Next 5

6 Navigate to the folder where you have downloaded the data sets that accompany this training and select opensap_stores_us.csv: Press Open. The selected data will be read by SAP Predictive Analytics: Press Create. 6

7 Visualize Room The data set will be created and you will enter the Visualize Room: On the left side you will see the Measures and Dimensions. Data is grouped into measures (for quantitative data) and dimensions (for categorical data). Measures and dimensions can be dragged directly to the Chart Canvas or to shelves in the Chart Builder. Dimensions can be thought of as the rows in a spreadsheet. These are those things you want to track. They are customers, pages, country of origin, product category and other items whose attributes are often nonnumerical. Commonly used dimensions are people, products, place and time. These functions are often described as "slice and dice". Slicing refers to filtering data. Dicing refers to grouping data. A common example involves sales as the measure, with customer and product as dimensions. In each sale a customer buys a product. The data can be sliced by removing all customers except for a group under study, and then diced by grouping by product. Measures are like the columns in a spreadsheet. They are the quantities you want to measure. Visits, page views, hits, bounce rate and other items that can be quantified numerically. A measure is a property on which calculations (e.g., sum, count, average, minimum, maximum) can be made. This exercise will use a number of different data visualizations to give you a deeper understanding of the data. SAP Predictive Analysis has already created measures for some numeric variables (automatic enrichment). Under dimensions it has listed the variables as numeric (123), and the Store name as a potential geographical variable (world icon). We can use this last information to create a geographical hierarchy. 7

8 Create Geographical Hierarchy Click on the Options button after the STORE variable: Select Create a geographic hierarchy By Names Press Confirm. 8

9 There are 139 solved and 11 unsolved geographical areas compared to the internal look-up table. The unsolved areas occur because there are multiple cities with the same name and manual confirmation is required to resolve the conflict. You will need to manually resolve the 11 unsolved areas as follows: Press Done. 9

10 The geographical hierarchy will now appear under Dimensions section: Data Visualizations There are a number of different visualizations you can now produce to start to gain a deeper understanding of the data: Simple bar chart: 10

11 Simple bar chart with multiple measures: Bar chart filtered for the top 10 stores by turnover: To achieve the filtered chart above, ensure you have TURNOVER as the measure, select the 123 radio button and choose the filter: Geographical maps can be used: 11

12 This indicates that there are some errors in the allocation of the geographical areas that were automatically assigned, as the cities should all be located in the USA. This error is an important finding in the analysis and the data should be corrected. This is achieved as follows: Select the options on the dimension Geography_STORE. Select edit Reconciliation Correct the errors: 12

13 Press Done. This analysis indicates that there is a wide distribution of margin and size across all of the stores in the US. To gain more specific information you could try to filter the top 20 or bottom 20 stores for example. The Scatter Matrix Chart will give you an initial understanding of possible outliers and groups within the data. Select the Scatter Matrix Chart: 13

14 This will create the following visualization: The scatter matrix chart shows that there are possibly two or more groups of stores. A bubble chart: The bubble chart shows 4 variables TURNOVER, SIZE, STAFF and STORE. It is filtered to show the stores for California only. There are clearly some very interesting stores. For example the store in the top right bubble Santa Clarita has large STAFF, TURNOVER and SIZE. However, just underneath there is another similar sized bubble representing Moreno Valley that has similar staff and turnover, but less store size. It would be interesting for the organization to understand why this store can achieve similar turnover with the same number of staff but in a smaller retail area. There are also some other interesting stores to investigate, such as Fresno and Oakland. Heat Maps and Tree Maps can provide useful comparative data insight: 14

15 The data can be viewed so you can pinpoint stores and look at the actual data: The Parallel Coordinates Chart can also be used to see outliers and potential groups in the data (remove filter): The Parallel Coordinates Chart confirms that there are possibly two or more groups of stores. Looking at the last vertical axis for MARGIN you will see that the stores are grouped into possibly two regions. This is also 15

16 true for the STAFF axis. Interestingly high MARGIN stores seem to group with high STAFF, low STAFF and relatively low TURNOVER. This insight should indicate that a segmentation model might give us very interesting results, but more about these algorithms later in the course. Radar charts can be used to compare different dimensions and point to unusual values. This chart has been simplified by filtering on the top 15 stores by selecting the option in the 123 radio button: Descriptive Statistics Go to the Predict Room: The data is shown in the data component on the left hand side. Click the green arrow radio button: Press the OK button and the data will be analysed: 16

17 This will take you to the Results tab with the Data Grid: Select the Statistical Summary Chart radio button: 17

18 This will give the Statistical Summary Chart with the distribution, count, min, max, range and standard deviation, variance, average, sum, count all values for the measures. Note that the count for each variable is 150. This means there are no missing values in any of these variables. The distribution for TURNOVER and SIZE is fairly normal, with an average of 5.84 and 3.05 respectively. However, the distributions for STAFF and MARGIN are more bimodal with two distinct peaks. The number of staff in a store ranges from 1 to 6.9, which represents 10 to 69 staff members. 18

19 The Parallel Coordinates Chart is available (this was described above): The Scatter Matrix Chart is available (this was described above): This completes the introductory exercise to Week 1 Unit 6 Initial Data Analysis & Exploratory Data Analysis. 19

20 FURTHER READING There are many more visualization in the Visualize Room that you can experiment with. You can also compose stories by selecting important presentations in the Compose Room, and then you can share them in the Share Room. Detailed instructions and information regarding other visualization options can be found in the user guide pa31_expert_user_en.pdf. 20

21 Coding Samples Any software coding or code lines/strings ( Code ) provided in this documentation are only examples and are not intended for use in a production system environment. The Code is only intended to better explain and visualize the syntax and phrasing rules for certain SAP coding. SAP does not warrant the correctness or completeness of the Code provided herein and SAP shall not be liable for errors or damages cause by use of the Code, except where such damages were caused by SAP with intent or with gross negligence. 21

22 SAP SE or an SAP affiliate company. All rights reserved. No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company. SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. Please see for additional trademark information and notices. Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors. National product specifications may vary. These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP SE or its affiliated companies shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP SE or SAP affiliate company products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty. In particular, SAP SE or its affiliated companies have no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or release any functionality mentioned therein. This document, or any related presentation, and SAP SE s or its affiliated companies strategy and possible future developments, products, and/or platform directions and functionality are all subject to change and may be changed by SAP SE or its affiliated companies at any time for any reason without notice. The information in this document is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as of their dates, and they should not be relied upon in making purchasing decisions.