A Smart Tool to analyze the Salary trends of H1-B Workers

Size: px

Start display at page:

Download "A Smart Tool to analyze the Salary trends of H1-B Workers"

Ethan Blair
5 years ago
Views:

1 1 A Smart Tool to analyze the Salary trends of H1-B Workers Akshay Poosarla, Ramya Vellore Ramesh Under the guidance of Prof.Meiliu Lu Abstract Limiting the H1-B visas is bad news for skilled workers in U.S. India and many other foreign country workers are going to be hit hard. Many Employers want the skilled workers to stay in U.S and work for lesser wages. We train our model with h1b petitions data set and classify the wages of h1b employees. Wages of h1b visa workers depends on the multiple factors like geographical conditions, Occupational Classification, Job title,soc_code and many more. Our aim is to build a model to classify the salaries of Entry level job positions related to IT sector of the H-1B applicants as low, average and high by considering various factors. We compare the classification performance of Naive Bayes, Support Vector Machines and Decision Trees. We also train our model using Multilinear Regression to predict the salaries as 1(low),2(average),3(high) based on certain threshold. Our analysis show that SVM performed better than Naïve Bayes, Decision Trees. Index Terms H1-B,soc_code,Occupational Classification I. INTRODUCTION The US H-1B visa is a non-immigrant visa that allows US companies to employ graduate level workers in specialty occupations that require theoretical or technical expertise in specialized fields such as in IT, finance, accounting, architecture, engineering, mathematics, science, medicine, etc. For the foreign workers to work in U.S.A. H1-B petitions has to be filed. Labor Condition Application as to be filed with DOL as part of H1-B process to certify that the employer will pay the sponsored H1-B employee higher of the actual wage at the work place or the prevailing wage in the industry. It s difficult to determine the actual wage of the company, usually the employer will look at the prevailing wage to determine the required salary for an H1-B employee. According to the DOL regulations, the actual wage for the particular job at the company is the wage rate the employer pays to other employees with similar experience and qualifications who are performing the same job as the H1-B worker. The employer need to determine whether they have. other employees with the same qualifications performing the same job as the H-1B worker. If so, the wage paid to those workers is the "actual wage." If no other employees are doing the same job as the H-1B worker, then the salary offered to the H-1B worker is the actual wage. Even after calculating the actual wage, employers need to compare the actual wage to the prevailing wage. If the prevailing wage is higher than the actual wage, employers need to pay the H-1B worker the prevailing wage. The "prevailing wage" is either the applicable wage under a collective bargaining agreement or, if there is no union, the average wage paid to workers in a particular occupation in a specific geographic location. We predict and analyze the factors such as job positions, employer and work location on which the determination of wages of H1-B employees is dependent. We predict the wages of H1-B employees as high, average and low by building the models using machine learning algorithms. The paper is structured in the following way: Section 2.presents Literature Survey 3 introduces the data collection, Section 4 talks about different data preprocessing techniques Section 5 gives us Interesting data insights Section 6 presents different machine learning algorithms applied for the model. In Section 7 we talked about different limitations in R Section 8 presents the comparison of the 4 models. Section 9 compares different Models II.LITERATURE SURVEY Text Analysis to predict H1-B wages for year 2012 is mentioned in [1].Decision Trees and Sun Burst View was used to determine the correlation between job_title,employer_name and many other job attributes to predict the wages of H1-B employees by analyzing the text. To predict the job salaries using the dataset provided by Kaggle for a competition, one of the participant used the absolute error and mean square error to predict an absolute value of the salaries. The error was found out by determining how much he missed the actual value by. Four different models was built to predict the salaries by considering different variables in each of the models. [2] III.DATA COLLECTION Collected the H1-B petition dataset from enigma.io website through rest API calls. The dataset contains 647,852

2 observations with 41 variables. There are more than 12,000 different employers and 10,000 unique job positions starting from accounts manager to web developers.

ENGINEER, we changed the SYSTEMS ENGINEER to SYSTEM ENGINEER and changed COMPUTER INFORMATION SYSTEM MANAGER, COMPUTER SYSTEMS MANAGER, COMPUTER AND INFORMATION SYSTEM MANAGER, COMPUTER INFORMATION

2 2 observations with 41 variables. There are more than 12,000 different employers and 10,000 unique job positions starting from accounts manager to web developers. The below image shows all the 41 different columns. ENGINEER, we changed the SYSTEMS ENGINEER to SYSTEM ENGINEER and changed COMPUTER INFORMATION SYSTEM MANAGER, COMPUTER SYSTEMS MANAGER, COMPUTER AND INFORMATION SYSTEM MANAGER, COMPUTER INFORMATION SYSTEMS MANAGER to COMPUTER AND INFORMATION SYSTEMS MANAGER.As there is no in built method to identify these kind of data in consistencies we have tried to make the data consistent manually. D. Removal of Outliers The highest prevailing wage is and most of the prevailing wage were in the range of 16,000 to 400,000.So removed the outliers before building the model. Outliers are removed by the normalization method [5] where we found the quantile for the data along with the median (50 percentile).the Inter Quartile range is given by difference between Q1 and Q3 where Q1 is 25 percentile and Q3 is 50 percentile. The values above Q3+1.5 IQR and Q1-1.5IQR considered as outliers. Fig 1. Columns in the Data Set before Preprocessing E. Feature Selection IV. DATA PREPROCESSING Removed NA and blank values from the dataset. Since there are more than 10,000 different job positions, we restricted the scope to predict and analyze the salary trends of entry level job positions of the IT sector. We considered 17 entry level job positions for the salary prediction. A. Handling of Categorical Values To train the model using some of the machine learning algorithms, some of the categorical values needs to be converted to numerical values. To convert the categorical values to binary we used the Python Pandas. B. Handling of Missing Values As the data is not very clean there are many places where one value in the column is missing. The first method we have tried is to replace missing values in the numerical column by finding the mean of the column and replaced with the mean. And for categorical variables we have filled the missing values with mode of the coloum[4].later when we train the model with this approach the predictions done by the model are not accurate so we have removed the rows with the missing values C. Data Consistency In order to train any model the data should be consistent across the data set but the data from the H1-B petition set is raw and not consistent. In order to make the data consistent we manually performed the operations. For example in the column+ of employer_state the State New York is given as NY, New York and NewYork.We manually changed this to single form of New York.Simialry these kind of operations are performed all the different states if there are any discrepancies among the names and for the other Colum job_title there are job_tiltes which are same but named differently for different employers such as SYSTEM ENGINEER and SYSTEMS Fig 2.Feature Selection [6] Feature selection is one of important step in the data preprocessing. In this when there m independent variables this will select n <= m independent variables which play a major role in predicting the output class. To select the important columns out of 41 columns to predict the prevailing wage we used the boruta [7] feature selection package. This method is based on Random Forest method with max of 100 Iterations where in each iterations it will decide whether the column is important or not. Interestingly as our data dataset has 41 columns boruta package has run for all the 100 iterations and classified 7 columns as important 5 are average and the remaining columns are not important in predicting the prevailing wage of the employee. We obtained 7 important

3 features such as job_title, employer_name, employer_state, agent_attorney_name, agent_attorney_state, soc_code which were considered for predicting the prevailing wage.

3 3 features such as job_title, employer_name, employer_state, agent_attorney_name, agent_attorney_state, soc_code which were considered for predicting the prevailing wage. All the other variables were rejected by boruta. Fig 4.Distribution between H1-B dependent and.. Non H1-B dependent companies The figure 5 shows us the top 20 companies who have filed more number of applications. Infosys is Indian based IT firm which tops the list of number of applications followed by Capgemini and Tata Consultancy Services Limited. The Big 4 companies for computer science are also in the list where Microsoft takes the top positions among these 4 with 5029 applications, followed by Google with 4785 applications and Amazon with 2547 applications. Interestingly Facebook is not the above list. Fig 3 Importance of columns after Boruta Package V. DATA INSIGHTS A company is termed as h1-b dependent company if at least fifteen percent of total employees are foreign workers and vice versa. The fig 4 is graphical comparison of number of petitions filed by the company Vs Company is H1-B dependent or not.n represents the company is not H1-B dependent while Y represent the company is H1-B dependent. Even though number of applications by H1-B dependent companies are higher in number, the number of H1-B dependent companies are just 10 percent of total companies. This shows the domination of number of applications filed by H1-B dependent companies Fig 5 Top 20 Companies with H1-B petitions Before the application is for H1-b the labor condition application should be filed with the department of labor. All the applications with the department of labor are classified into four different types. CERTIFIED: Applications is certified by the department of labor DENIED: Application is denied by the department of labor CERTIFIED_WITHDRAWN: The application is withdrawn by the employer after it is certified by the department of labor. WITHDRAWN: The application is withdrawn by the employer before the department of labor takes decision on it.

4 Finally we applied Naïve Bayes Classifier for one against many classes [8] where we divided the prevailing wage into three classes: 1(low) for the wages below 60000,2(average) for wages between

4 4 Finally we applied Naïve Bayes Classifier for one against many classes [8] where we divided the prevailing wage into three classes: 1(low) for the wages below 60000,2(average) for wages between 60,000 and 90,000 and 3(high) for the wages above 90,000.We trained the model with one against many i.e wages below 60,000 against wages above 60,000 and below 90,000 and for the wages between 60,000 and 90,000 against the wages above By building this one against many class Naïve Bayes classifier we obtained an accuracy around 83 percent. B.Multilinear Regression Fig 6 Distribution between Applications From the figure 6 we can say that about 85 percent of the total applications filed are certified by the department of labor and rest 20 percent of applications are having the ratio as shown in the graph. To train the model using multilinear regression [9] all the categorical values needs to be changed to factors and assign the labels for it.since the employer name is a categorical value in our dataset and there are more 12,000 unique employer names we were not able to assign labels for each of the 12,000 employer names,r could not allocate a vector of 9.3GB when tried to train the dataset using multilinear regression. So we thought of converting the categorical values to binary using python pandas [10] before training the data with multilinear regression. But when tried to convert the employer names into binary,the csv format of the dataset got corrupted as it was creating 12,000X12,000 square matrix. Basically what python panda s functionality is that when we pass asset of values as input that gives a sparse square matrix of size n X n.as there are around different so finally we had an option of doing random sampling and train the model using multilinear regression. We did a random sampling of the data and trained the model using multilinear regression and the R mean squared error was found to be 0.86.Higher the value of R mean squared error better the model. Fig 7 Salary Distribution of H1-B Employees From the above figure we can say that most of the salaries of the H1-b workers are in the range of which is class 2.The class 1 in the above graph represents number of employees with salary less than and the class 3 represents the employees with salary greater than A. Naïve Bayes Classifier: VI. APPLIED MODELS We removed the outliers and divided the prevailing wage into nine classes and starting from 16,000 to 13,0000 and the width of each of classes was calculated using normalization i.e we used mean and standard deviation to calculate the width of each class and wanted to know in which of the nine classes each of the prevailing wage was falling into. Built the naive Bayes classifier model but accuracy was around 40% which is very less. So we divided the prevailing wage into three classes and trained our data using Naïve Bayes classifier, but still we accuracy was around 50%. C.Support Vector Machines As we know support vector machine is classification algorithm which classifies the two classes. As our target class in this model has three classes we have trained our model using one against many classes. Random sampling of the data is done with Caret package [11] in R and each sample taken is divided into training classes and testing class with ration 80 and 20 respectively. The one difficulty we have encountered after doing the random sampling is the error class not found.as we have around different employer names test set contained the employers which are not present in the training set. So even after random sampling we iterated over the test set and removed the values which are present only in testset.this method helped us in overcoming the problem of new levels present in the test set alone. Trained the model using one against many classes based support vector machines [12] approach and obtained an accuracy of 95.84% D. Decision Trees Trained the model using decision tree [13] machine learning algorithm and obtained an accuracy of 94.94%. When tried to plot the decision tree, the predictor variables with more than 52 levels was not printed. This was the limitation in R. Even

5 when tried to plot the decision tree then the tree was visualized but we cannot decode the rules corresponding to the Decision Tree. The decision tree which was printed in R is as below. Fig 8.

Once the dataset is uploaded it provide us different ways to create a data set by selecting the required columns from the original dataset and gives us the option to select the complete data or do

5 5 when tried to plot the decision tree then the tree was visualized but we cannot decode the rules corresponding to the Decision Tree. The decision tree which was printed in R is as below. Fig 8. Fig 8 Decision Tree Plot E.Text Based Analysis Big ml [14]is one of the machine learning website where user can create a login and upload the data set of their interest. Once the dataset is uploaded it provide us different ways to create a data set by selecting the required columns from the original dataset and gives us the option to select the complete data or do the random sampling. Once the data set is created We can analyze different columns based on the visualizations generated.ref: Fig 9 Fig 10. Data path Visualization in Big Ml VII. Limitations of R For a large dataset converting of categorical values into numeric was a big question. Where we have to assign labels for each of the factors. Assigning labels to categorical variable which has 12,000 levels is tedious process. We cannot train the dataset using random forest in R, if the dataset contains the categorical variables with more than 32 levels.it cannot handle categorical predictors with more than 32 categories. When plotted the decision tree, the predictor variables with more than 52 levels was not printed. We could not interpret the rules of the decision tree. Visualization is a limitation in R. It is very difficult to connect the model to the front end and get the input from the user and to pass them to the trained model where it is simple in Python. Fig 9 Column Visualization in Big ML We can train our data set with different models and results will be predicted once the model is trained. We have used text based analysis model to analyze and predict the salaries of H1- B employees. The predicted model is viewed and rules can be decoded from the model. For example if we want the wage corresponding to the software engineer in the Facebook who earns average of 104 k. The tree generated by the model is very UI friendly where you can zoom at a particular node and get the rule corresponding to that node.

6 6 Model Naïve Bayes Classifier (one against many) Support Vector Machines (random sampling) VIII RESULTS Accuracy 83% % Decision Trees 94.94% Multilinear Regression (random sampling) IX CONCLUSION R -squared error:0.56 As per our decision tree text analysis method California is the state with highest average wage and the most important factor in predicting the wage of the employee is Job title followed by location and next to these two comes the Employer name. If there are multiple classes in the target variable Naïve Bayes One against Many classes always gives better results compared to Naïve Bayes Method. [8] S. Rana and A. Singh, "Comparative analysis of sentiment orientation using SVM and Naive Bayes techniques," nd International Conference on Next Generation Computing Technologies (NGCT), Dehradun, 2016, pp [9] otes/401-multreg.pdf [10] [11] t.pdf [12] I. Dilrukshi and K. De Zoysa, "Twitter news classification: Theoretical and practical comparison of SVM against Naive Bayes algorithms," 2013 International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, 2013,pp [13]J.R. Quinlan, "Induction of Decision Trees" in, Boston:Kluwer Academic Publishers, vol. 1, pp , [14] Decision Tree gives us very good result but if we have more factors and levels it is difficult to decode rules Text analysis for this data set worked pretty well as we can infer more results and rule from the data X REFERENCES [1]. [2] 2.pdf [4] P. Khongchai and P. Songmuang, "Improving students'motivation to study using salary prediction system," th International Joint Conference on Computer Science and Software Engineering (JCSSE), Khon Kaen, 2016, pp [5]Z. J. Kovacic, "Early Prediction of Student Success: Mining Students Enrolment Data", Proceedings of Informing Science & IT Education Conference (InSITE), [6] G. Forman, "An Extensive Empirical Study of Feature Selection Metrics for Text Classification", Journal of Machine Learning Research, vol. 3, pp , [7]

Understanding General Trends in Permanent Visa Applications and Predicting Visa Decisions using SAS Enterprise Miner.

Understanding General Trends in Permanent Visa Applications and Predicting Visa Decisions using SAS Enterprise Miner. ARUN TEJA BAIREDDLAPALLI KRISHNA REDDY OKLAMOHA STATE UNIVERSITY Contents ABSTRACT...