Bayesian Visual Analysis of the Indian Labour Market

Size: px
Start display at page:

Download "Bayesian Visual Analysis of the Indian Labour Market"

Transcription

1 Bayesian Visual Analysis of the Indian Labour Market Kaushal Paneri TCS Research New Delhi India Karamjit Singh TCS Research New Delhi India Geetika Sharma TCS Research New Delhi India Aditeya Pandey TCS Research New Delhi India 1. INTRODUCTION The IKDD CODS Data challenge presents an opportunity to explore the dynamics that might influence the Indian labour market by analysing employment data. The data challenge expects to learn a prediction model for salaries, understand the key dependencies and present our insights through visualizations. We are using TCS Research s ifuse platform to derive insights and make predictions using the data provided. ifuse is a web-based visual analytics platform with built-in machinelearning capabilities based on Bayesian graphical models. We use ifuse to learn new models using domain knowledge, statistically validate hypotheses and analyse data as well as models and model-predictions using a variety of visualization techniques. In particular we have used the following ifuse features to address the data challenge: 1) Bayesian Network Models: Bayesian model learning in ifuse first learns which attributes are most relevant to predict a target, which in our case is the salary of each individual. Next, an efficiently executable Bayesian network is learned on this feature subset (via an MST embedded in a graph derived from pair-wise mutual information values). The ifuse platform uses exact inference accelerated by an SQL-engine (which internally performs query optimization which is analogous to many poly-tree based exact inference techniques). Running model-inference for each record of test data can be used to generate salary predictions. 2) Visual Model Inference: Even before applying a Bayesian model on test data, the ifuse platform allows for visual execution of model inference on training and validation sets to understand the impact and sensitivity of each feature with the target attribute (salary in this case). User can make a query to examine which ranges/values of features effects salary, aiding in a qualitative understanding of the causes of higher or lower salary values. Note that such visual queries use model inference and so, in the case of large data volumes these are more efficient than directly querying the data Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. CoDs Data Challenge 2016 Pune, India c 2016 ACM. ISBN /08/06... $15.00 DOI:10.475/123 4 (though this is not an issue in the case of the CoDS data challenge.) 3) Visual Analytics: the ifuse Platform possesses visual analytics capabilities to provide a more comprehensive view and generate meaningful insights. We provide interactive visualizations such as motion charts for multi variate timeseries data which can visually represent up to 4 temporal data attributes using position, size and colour of circles. Bubble map charts which visualize data about specific locations on a map; in our case on a map of India. Parallel coordinates which can precisely depict multi-dimensional data in one view, and linked views with probabilistic querying to explore the data and obtain answers to important questions. These additional visualizations lend further insight on factors affecting the labour market. We summarise the key findings from our analysis: First, the designations with the highest head counts are software engineer, software developer and systems engineer. Second, there is a positive correlation between salary and 12th percentage and CGPA. Third, through Bayesian network learning, we discovered a strong correlation between high salaries and high English score, while logical test score and CGPA did not seem to affect salary as much. Further, we found no correlation between English scores and Domain scores. Fourth, we found a decreasing trend in salaries being offered to candidates in recent years. We motivate the use of Bayesian networks for attempting the challenge in section 2 and describe our ifuse platform in section 3. In section 4 we report insights from an exploratory visual analysis of the dataset and describe our prediction results in section 5. Finally, we present recommendations in section 6 and conclude in section BAYESIAN GRAPHICAL MODELS A Bayesian network is a graphical structure that allows representation and reasoning about an uncertain domain. It is a representation of joint probability distribution (JPD) which consist of two components. The first component G is directed acyclic graph whose vertices correspond to random variables. The second component, the conditional probability table (CPT), describes the conditional distribution for each variable given its parents. A CPT of a node indicates the probability that the each value of a node can take given all combinations of values of its parents. Considering a BN consisting of N random variables X = (X 1,X 2,...,X N), the general form of joint probability distribution of a Bayesian

2 network can be represented as in Equation 1, which encodes the BN property that each node is independent given its parents, where Pa(X i) is the set of parents of X i. P(X 1,X 2,...,X n) = n P(X i Pa(X i)) (1) A Bayesian network (BN) fills a role very similar to other machine learning algorithm such as Artificial neural network (ANN), Decision tree, or Support vector machine(svm). However, a BN has several unique advantages over some of the other machine learning algorithms. First, a BN can handle missing values very well, whereas, many machine learning algorithms require incomplete data to be eliminated or extrapolated. Second, a BN can be queried unlike other algorithms like SVM or ANN. e.g. what is the expected salary of an individual given his/her degree CGPA, 10th marks etc. To address the problem of analysing labour markets, posed by the Challenge, we use our ifuse platform which has the capability to learn efficient Bayesian networks on data attributes which are most relevant to target variable, salary. Further, ifuse can run model inference on each record of test data to generate expected value of salary as prediction as well as perform conditional queries on the network to provide profile recommendations. 3. IFUSE: A VISUAL BAYESIAN FUSION PLATFORM In this section, we describe ifuse, our visual Bayesian data fusion platform, through its three main features. The motivation for building a visual analytics platform using Bayesian networks as the sole machine learning engine came from the success, both research and business-wise, of our earlier platform Fusion Workbench[4] which was built for visual analytics over sensor data using multiple machine learning techniques. 3.1 Keyword-based Search Our platform provides keyword-based search over datasets. When datasets are added to the platform, tags based on column headers are automatically generated and used for indexing files. Users may add their own tags as well and enter them as search keywords for retrieval later. Datasets are represented as tiles on the search page, as shown in figure 1, with tag clouds providing a sense of what s in the data. The tiles may be flipped by double-clicking on them to see a complete list of attribute names, figure Exploratory Data Visualization Exploratory data analysis is required to get a better understanding data before Bayesian modelling can be done. Further, it can lead to insights not derivable from Bayesian reasoning. ifuse provides many visualizations for exploratory analysis as described below. i=1 Motion Charts: A multi-dimensional visualization which can visually represent up to 4 temporal data attributes using position, size and colour of circles as shown in figure 4. Motion animation of the circles is used to depict changes in data over time. Parallel Coordinates: Another multi-dimensional data visualization which allows a larger number of data attributes to be visualised together. Attributes are represented as multiple parallel vertical or horizontal axes and a data point is represented as a poly line connecting points on each attribute axis as shown in figure 3. The order of axes may be changed by dragging and attributes can be deleted or added to the plot. Bubble Map Charts: Plot bubbles or circles at geographical locations on a map with data attributes mapped to properties of the bubble such as size and colour. A bubble map is shown in figure 7. Cartograms: Use maps to visualise data about regions such as countries and states. Colour and rubbersheet distortions in area proportional to data values allow easy comparison of spatial data. Apart from these well-known data charts, we also have our own visualization designs for specific purposes such as querying Bayesian networks described later. Data tiles display icons for visualizations associated with them. In case multiple visualizations can be drawn for a single dataset, an icon for each is displayed and clicking it opens the selected visualization. All visualizations open in the Compare View page. A list of thumbnails is displayed on the left side of the page using which users can re-order visualizations vertically, close them or open them in fullscreen mode. 3.3 Visual Bayesian Fusion ifuse supports and utilises Bayesian models at multiple levels Model Learning Firstly, users can build Bayesian networks by selecting relevant attributes from different datasets joined inside the system. We provide a visual interface for this as shown in figure 2. Data attributes for model learning can be selected from flipped data tiles and are added to the attribute cart. The user then chooses the Request Network option, selects target variable and triggers the network learning module in the backend. It returns with the top few networks and the user can choose which ones to save in the platform Model Inferencing Once a network has been saved, it can be used to perform visual model inferencing using what we call a Linked Query View, figure 11. This is an interactive linked view especially designed to query Bayesian networks. The user selects n attributes from the network to query and these are visualized in an n n chart grid with attributes repeated horizontally and vertically. Charts along the diagonal, show the probability distributions of the corresponding attribute as bar charts figure 11. On the upper diagonal are scatter plots of the data with row and column attributes on the x and y axis of the plot respectively. These provide a view of the data used to build the network and can bring out pair-wise correlations between attributes. In order to query the network, users can select ranges for multiple attributes by clicking on appropriate bars in the bar charts. This puts a condition on the attribute to be in the range selected by the user. On hitting the query button, a conditional query is executed on the network using Bayesian inference. The conditional distributions of the

3 other attributes are computed and the bar charts are updated accordingly. We provide a comparison view with the initial and conditional distributions overlayed in the different colours so that changes in the distributions can be perceived easily Model-based Prediction ifuse provides a visual interface for model-based prediction using parallel coordinates. The user selects a network to be used for prediction via imputation and a dataset with the target variable missing. We use a horizontal parallel coordinates plot so as to differentiate it from the exploratory parallel coordinates visualization as well as to indicate a network structure which is usually drawn in a top-down order even though the edges have no directionality in this case. The value of the attribute to be imputed is 0 for all data points initially as shown in figure 3 (a). Clicking the Impute button fires the imputation module at the backend and lines for the imputed values are moved to their position along the axis, figure 3 (b). 4. EXPLORATORY VISUAL ANALYSIS OF CODS DATASET In this section we report the results from an exploratory visual analysis of the CoDs data. We prepared three datasets by selecting attributes from the dataset such as Job city, designation, CGPA, Salary and Quant. One of the attributes from these was selected as a key to group by and the rest of the attributes were averaged. Details of each dataset are given below. DS 1: Key: Job Designation, Attributes: Salary, 10Per, 12Per, CGPA, Domain, English, Quant, Logic. We further cleaned this data by fixing typos and merging similar designations such as technical lead and tech lead. This reduced the number of unique designations to 270. DS 2: Key: Degree, Year Avg. Attributes: Salary, 10Per, 12Per, CGPA, Domain, English, Quant, Logic. DS 3: Key: Job City, Avg.Attributes: Salary, 10Per, 12Per, CGPA, Domain, English, Quant, Logic. DS 4: Key: Job City, Avg. Attributes: Salary, Total Jobs, Number of Males, Number of Females, Scores on Computer Programming, CSE, ECE, Mechanical, Telecom and so on. Insights from DS 1 Figure 4 (a) shows a view of DS 1 with a circle plotted for each a designation, circle radius and x axis showing count of test takers with a particular designation and salary on y axis. Given that there were 270 unique designations it is not surprising that majority of the designations had counts below 25. Figure 4 (b) shows the same data with filtered to show only the popular designations. As expected, software engineer, software developer and system engineer have the largest counts. Further, Web developer, lecturer and customer care executive are lower on the salary scale, software engineers are in the middle and data scientist s, automation engineers and senior software developers are higher. Figure 4 (c) and (d) map salary to the x axis, designation count to the size of the circle and on y axis the average 12th standard percentage and college CGPA, respectively. As is clear from the charts, each of these is positively correlated with salary. Answer to Challenge question: Figure 5 (a) maps salary to x axis, English score to y axis and Domain score to size of the circles. We observe that both English and Domain score is low for it support and customer care executives, while English score is high but Domain score is low for people in Business related roles. Also, we observe that in IT jobs, pure development jobs such as Java developer, Software developer have lower English scores than engineering jobs such as software engineer but there is little variation in their Domain scores. Thus, there is no correlation between English scores and Domain scores. Finally, in figure 4 (b) we put salary on the y axis find no obvious correlation with Domain score as both big and small circles are in the same salary range but a slight positive correlation with English score. This answers the specific question posed by the challenge about whether candidates with high English scores also have high Domain skills and their affect on salary. Insights from DS 2 In figure 6 (a) we plot the average salary being offered to candidates (y axis) over the years (x axis) for which data was available. We observe a decreasing trend in recent years. Insights from DS 3 In figure 6 (b) we plot the average salary for different cities in a bar chart. We observe that cities such as Bangalore, Mumbai, Gurgaon and Hyderabad have higher avg salaries, while Faridabad, Bhubaneswar and Calcutta have lower average salaries. Insights from DS 4 In figure 7 we show visualizations using bubble map charts with data attributes mapping to size and colour of the circle for each job city in India. For each plot colour has been mapped to total jobs clearly showing Bangalore having the largest number of jobs, with Noida, Hyderabad, Pune and Chennai in tow. The charts in (a) and (b) map size to number of males and females in the city. There is a sharp decrease in both the number of cities where females are placed and number of females. Next, we show the average scores on various specializations for each city. As is clear from figure 8 (a) and (b) computer programming score is high all over India whereas candidates with high CSE scores are mostly in the extreme northern and southern cities. Similarly while ECE scores (c) where high for most job cities, high Telecom scores (d) were found in extreme northern and southern cities 5. PREDICTION FOR TEST DATA We now describe in detail our technique for prediction using Bayesian networks. 5.1 Feature Selection We select top K features based on the mutual information of all features with target variable. Mutual information between continuous-continuous, and continuous-discrete variables is calculated using Non-parametric Entropy Estimation toolbox(npeet)[5]. This tool implements [1] to find mutual information estimators, which are based on entropy estimates from k-nearest neighbour distances. 5.2 Bayesian Structure Learning

4 Once we identify subset of features based on mutual information, we learn efficiently executable Bayesian network on these top K features including target variables. We call it Minimum Spanning Tree Network (MSTN). We learn the structure of MSTN with the following approach 1. Given subset of K features including both continuous and discrete variables. 2. We learn the minimum spanning tree(mst) on feature subset using pairwise mutual information as a threshold. 3. Initialize each edge to random direction. 4. Flip each edge direction to compute 2 K 1 directed graphs and calculate the cross entropy of each graph. 5. Choose a graph with least cross entropy. Once we learn the structure, we learn the CPT of each node in a network. 5.3 Predicting salary We use the MSTN, learned on the relevant feature subset, to predict the salary of each test data using rest of the features in a network as evidence. Fig 10, shows the MSTN learned on feature subset. We use exact inference accelerated by an SQL engine which internally performs query optimization which is analogous to many poly-tree based exact inference techniques. Apart from MSTN, we also use Naive Bayes network on the same feature subset to predict salary. 5.4 Results and Analysis Table 1, shows the Root mean square error (RMSE) of predictions made on the training data by 5 fold cross validation. It compare the RMSE of various approaches such as Naive Bayes, MSTN, Regression Tree, Random forest. It shows that prediction using MSTN in ifuse platform is better as compared to other standard approaches like Regression tree(split at <10 instances) [2], and Random forest (Random 5 attributes per split, 10 trees, prune at <5 instances)[3], and Naive Bayes. Table 2, shows RMSE on leaderboard using ifuse(mstn network) and the best RMSE (top ranker) on the leaderboard. A visualization of the predicted salary on test data created from of the training data in 70-30(%) ratio is shown in figure 3. The error in prediction can be visualised in (b) on the last two axes - actual salary and imputed salary. We show the predicted salary for test data provided by the Challenge in figure 9. We observe that salary is highly skewed, e.g. only 1% is greater than 10L and a single model cannot handle the skew in the target variable. This is one possible reason for high RMSE. We recommend using an ensemble of Bayesian networks trained on different salary segments. However, we did not try this technique because it is not yet implemented in the ifuse platform and we wanted to use the platform only rather than try all possible techniques outside its capabilities. 6. BAYESIAN ADVISOR We have used the model inferencing feature of ifuse to answer the Recommendation question posed by the challenge. As shown in figure 11 a network with target attribute Table 1: RMSE of five cross validation on training data Algorithm RMSE Random Forest Regression Tree ifuse-mstn ifuse-naive Bayes Table 2: RMSE on leaderboard ifuse Best RMSE salary and English score, Logical test score, 12th percentage and college CGPA has been created. The original probability distribution are shown on the diagonal using bar charts with yellow bars. Salary is plotted on log 10 scale. We considerthecasewhenacandidateis interestedin getting a very high salary and wants to know what is the ideal profile for the same. This may be done in ifuse s Linked Query View by selecting the last two bars on the salary distribution as shown in figure 11 (a) and hitting the query button in the menu. This triggers a conditional query at the backend which is resolved using inferencing and the conditional probability distributions of the remaining attributes and computed and displayed with red bars in the linked query view. There is a comparison mode available which helps understand the exact changes in the distributions. We have used this mode in the figure and observe that there is a significant rise in the probability of the second last bin for English score, indicating that a candidate must have high English score to get a high salary. Additionally, we find that although the distributions of logical test score and CGPA do not change much, probability of 12th percentage increases significantly for the higher range bins. Thus, for a very high salary English score and 12th percentage must be high. Next, we consider the case when a candidate is willing to lower the salary expectation to mid to high range, figure 11 (b). In this case the distribution of English score changes only slightly for the higher bins, while for 12th percentage, probabilities for the mid to high bins increase significantly. Finally, we consider the case when a candidate is interested in a mid to high salary but has low English score. Such a query may be performed by selecting the appropriate bars on the distributions of both salary and English score as shown in figure 11 (c). This causes the distribution of logical ability to shift to the middle range while the distribution of 12th percentage shifts significantly to the higher bins. Thus, for a high salary with low English score, one must have a good logical test score and very good 12th standard percentage. In this manner a candidate may impose conditions on any number of the variables in the network and get answers to how his/her profile should changein order to meet the salary goal. 7. CONCLUSIONS To conclude, we have met the objectives set out by the CoDs Data Challenge using our ifuse platform built visual Bayesian data fusion. We have demonstrated how the platform may be used to perform exploratory visual analysis on

5 the raw data and gather useful insights and obtain a deeper understanding of the data. Further, we have shown how Bayesian network models are utilised in our platform for salary prediction and providing profile recommendations. 8. REFERENCES [1] A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutual information. Physical review E, 69(6):066138, [2] R. J. Lewis. An introduction to classification and regression tree (cart) analysis. In Annual Meeting of the Society for Academic Emergency Medicine in San Francisco, California, pages 1 14, [3] A. Liaw and M. Wiener. Classification and regression by randomforest. R news, 2(3):18 22, [4] G. Sharma, G. Shroff, A. Pandey, B. Singh, G. Sehgal, K. Paneri, and P. Agarwal. Multi-sensor visual analytics supported by machine-learning models. In ICDM Workshop on Data Analytics meets Visual Analytics, [5] G. Ver Steeg. Non-parametric entropy estimation toolbox (npeet) APPENDIX - IMAGES Figure 1: Search page showing CoDs Data Challenge datasets Figure 10: Minimum Spanning Tree Network Learned on Feature Subset Figure 2: Data Tile Flipped View and Attribute Selection for Network Creation

6 (a) Before Imputation (b) After Imputation. Last two axes visualise error between actual and predicted salary. Figure 3: Parallel Coordinates Plot for Salary Prediction using Imputation on test data created from 30% training data.

7 (a) Complete Data (b) Filtered on popular designations (c) Positive Correlation of 12th Percentage and Salary (d) Positive Correlation of College CGPA and Salary Figure 4: Insights from Data set 1 using motion charts

8 (a) No Correlation between English and Domain score (b) Positive correlation between Salary and English score but not with Domain score Figure 5: Answer to Challenge Question about correlation between Domain score, English score and Salary

9 (a) DS 2 Yearly average salary (b) DS 3 City wise Salary Figure 6: Insights from Data sets 2 and 3

10 (a) Size: Number of Males, Colour: Total Jobs (b) Size: Number of Females, Colour: Total Jobs Figure 7: Comparison between number of male and female candidates and placement cities using Data set 4 and Bubble Map charts

11 (a) Size: Avg. Computer Prog., Score Colour: Total Jobs (b) Size: Avg. CSE Score, Colour: Total Jobs (c) Size: Avg. ECE Score, Colour: Total Jobs (d) Size: Avg. TeleCom Score, Colour: Total Jobs Figure 8: Insights from Data set 4 using Bubble Map charts Figure 9: Salary Imputation for Test dataset provided by Challenge

12 (a) Query for high salary (b) Query for mid to high salary (C) Query for high salary and low English score Figure 11: Recommendation using Linked Query View