A Smart Tool to analyze the Salary trends of H1-B Workers
|
|
- Ethan Blair
- 5 years ago
- Views:
Transcription
1 1 A Smart Tool to analyze the Salary trends of H1-B Workers Akshay Poosarla, Ramya Vellore Ramesh Under the guidance of Prof.Meiliu Lu Abstract Limiting the H1-B visas is bad news for skilled workers in U.S. India and many other foreign country workers are going to be hit hard. Many Employers want the skilled workers to stay in U.S and work for lesser wages. We train our model with h1b petitions data set and classify the wages of h1b employees. Wages of h1b visa workers depends on the multiple factors like geographical conditions, Occupational Classification, Job title,soc_code and many more. Our aim is to build a model to classify the salaries of Entry level job positions related to IT sector of the H-1B applicants as low, average and high by considering various factors. We compare the classification performance of Naive Bayes, Support Vector Machines and Decision Trees. We also train our model using Multilinear Regression to predict the salaries as 1(low),2(average),3(high) based on certain threshold. Our analysis show that SVM performed better than Naïve Bayes, Decision Trees. Index Terms H1-B,soc_code,Occupational Classification I. INTRODUCTION The US H-1B visa is a non-immigrant visa that allows US companies to employ graduate level workers in specialty occupations that require theoretical or technical expertise in specialized fields such as in IT, finance, accounting, architecture, engineering, mathematics, science, medicine, etc. For the foreign workers to work in U.S.A. H1-B petitions has to be filed. Labor Condition Application as to be filed with DOL as part of H1-B process to certify that the employer will pay the sponsored H1-B employee higher of the actual wage at the work place or the prevailing wage in the industry. It s difficult to determine the actual wage of the company, usually the employer will look at the prevailing wage to determine the required salary for an H1-B employee. According to the DOL regulations, the actual wage for the particular job at the company is the wage rate the employer pays to other employees with similar experience and qualifications who are performing the same job as the H1-B worker. The employer need to determine whether they have. other employees with the same qualifications performing the same job as the H-1B worker. If so, the wage paid to those workers is the "actual wage." If no other employees are doing the same job as the H-1B worker, then the salary offered to the H-1B worker is the actual wage. Even after calculating the actual wage, employers need to compare the actual wage to the prevailing wage. If the prevailing wage is higher than the actual wage, employers need to pay the H-1B worker the prevailing wage. The "prevailing wage" is either the applicable wage under a collective bargaining agreement or, if there is no union, the average wage paid to workers in a particular occupation in a specific geographic location. We predict and analyze the factors such as job positions, employer and work location on which the determination of wages of H1-B employees is dependent. We predict the wages of H1-B employees as high, average and low by building the models using machine learning algorithms. The paper is structured in the following way: Section 2.presents Literature Survey 3 introduces the data collection, Section 4 talks about different data preprocessing techniques Section 5 gives us Interesting data insights Section 6 presents different machine learning algorithms applied for the model. In Section 7 we talked about different limitations in R Section 8 presents the comparison of the 4 models. Section 9 compares different Models II.LITERATURE SURVEY Text Analysis to predict H1-B wages for year 2012 is mentioned in [1].Decision Trees and Sun Burst View was used to determine the correlation between job_title,employer_name and many other job attributes to predict the wages of H1-B employees by analyzing the text. To predict the job salaries using the dataset provided by Kaggle for a competition, one of the participant used the absolute error and mean square error to predict an absolute value of the salaries. The error was found out by determining how much he missed the actual value by. Four different models was built to predict the salaries by considering different variables in each of the models. [2] III.DATA COLLECTION Collected the H1-B petition dataset from enigma.io website through rest API calls. The dataset contains 647,852
2 2 observations with 41 variables. There are more than 12,000 different employers and 10,000 unique job positions starting from accounts manager to web developers. The below image shows all the 41 different columns. ENGINEER, we changed the SYSTEMS ENGINEER to SYSTEM ENGINEER and changed COMPUTER INFORMATION SYSTEM MANAGER, COMPUTER SYSTEMS MANAGER, COMPUTER AND INFORMATION SYSTEM MANAGER, COMPUTER INFORMATION SYSTEMS MANAGER to COMPUTER AND INFORMATION SYSTEMS MANAGER.As there is no in built method to identify these kind of data in consistencies we have tried to make the data consistent manually. D. Removal of Outliers The highest prevailing wage is and most of the prevailing wage were in the range of 16,000 to 400,000.So removed the outliers before building the model. Outliers are removed by the normalization method [5] where we found the quantile for the data along with the median (50 percentile).the Inter Quartile range is given by difference between Q1 and Q3 where Q1 is 25 percentile and Q3 is 50 percentile. The values above Q3+1.5 IQR and Q1-1.5IQR considered as outliers. Fig 1. Columns in the Data Set before Preprocessing E. Feature Selection IV. DATA PREPROCESSING Removed NA and blank values from the dataset. Since there are more than 10,000 different job positions, we restricted the scope to predict and analyze the salary trends of entry level job positions of the IT sector. We considered 17 entry level job positions for the salary prediction. A. Handling of Categorical Values To train the model using some of the machine learning algorithms, some of the categorical values needs to be converted to numerical values. To convert the categorical values to binary we used the Python Pandas. B. Handling of Missing Values As the data is not very clean there are many places where one value in the column is missing. The first method we have tried is to replace missing values in the numerical column by finding the mean of the column and replaced with the mean. And for categorical variables we have filled the missing values with mode of the coloum[4].later when we train the model with this approach the predictions done by the model are not accurate so we have removed the rows with the missing values C. Data Consistency In order to train any model the data should be consistent across the data set but the data from the H1-B petition set is raw and not consistent. In order to make the data consistent we manually performed the operations. For example in the column+ of employer_state the State New York is given as NY, New York and NewYork.We manually changed this to single form of New York.Simialry these kind of operations are performed all the different states if there are any discrepancies among the names and for the other Colum job_title there are job_tiltes which are same but named differently for different employers such as SYSTEM ENGINEER and SYSTEMS Fig 2.Feature Selection [6] Feature selection is one of important step in the data preprocessing. In this when there m independent variables this will select n <= m independent variables which play a major role in predicting the output class. To select the important columns out of 41 columns to predict the prevailing wage we used the boruta [7] feature selection package. This method is based on Random Forest method with max of 100 Iterations where in each iterations it will decide whether the column is important or not. Interestingly as our data dataset has 41 columns boruta package has run for all the 100 iterations and classified 7 columns as important 5 are average and the remaining columns are not important in predicting the prevailing wage of the employee. We obtained 7 important
3 3 features such as job_title, employer_name, employer_state, agent_attorney_name, agent_attorney_state, soc_code which were considered for predicting the prevailing wage. All the other variables were rejected by boruta. Fig 4.Distribution between H1-B dependent and.. Non H1-B dependent companies The figure 5 shows us the top 20 companies who have filed more number of applications. Infosys is Indian based IT firm which tops the list of number of applications followed by Capgemini and Tata Consultancy Services Limited. The Big 4 companies for computer science are also in the list where Microsoft takes the top positions among these 4 with 5029 applications, followed by Google with 4785 applications and Amazon with 2547 applications. Interestingly Facebook is not the above list. Fig 3 Importance of columns after Boruta Package V. DATA INSIGHTS A company is termed as h1-b dependent company if at least fifteen percent of total employees are foreign workers and vice versa. The fig 4 is graphical comparison of number of petitions filed by the company Vs Company is H1-B dependent or not.n represents the company is not H1-B dependent while Y represent the company is H1-B dependent. Even though number of applications by H1-B dependent companies are higher in number, the number of H1-B dependent companies are just 10 percent of total companies. This shows the domination of number of applications filed by H1-B dependent companies Fig 5 Top 20 Companies with H1-B petitions Before the application is for H1-b the labor condition application should be filed with the department of labor. All the applications with the department of labor are classified into four different types. CERTIFIED: Applications is certified by the department of labor DENIED: Application is denied by the department of labor CERTIFIED_WITHDRAWN: The application is withdrawn by the employer after it is certified by the department of labor. WITHDRAWN: The application is withdrawn by the employer before the department of labor takes decision on it.
4 4 Finally we applied Naïve Bayes Classifier for one against many classes [8] where we divided the prevailing wage into three classes: 1(low) for the wages below 60000,2(average) for wages between 60,000 and 90,000 and 3(high) for the wages above 90,000.We trained the model with one against many i.e wages below 60,000 against wages above 60,000 and below 90,000 and for the wages between 60,000 and 90,000 against the wages above By building this one against many class Naïve Bayes classifier we obtained an accuracy around 83 percent. B.Multilinear Regression Fig 6 Distribution between Applications From the figure 6 we can say that about 85 percent of the total applications filed are certified by the department of labor and rest 20 percent of applications are having the ratio as shown in the graph. To train the model using multilinear regression [9] all the categorical values needs to be changed to factors and assign the labels for it.since the employer name is a categorical value in our dataset and there are more 12,000 unique employer names we were not able to assign labels for each of the 12,000 employer names,r could not allocate a vector of 9.3GB when tried to train the dataset using multilinear regression. So we thought of converting the categorical values to binary using python pandas [10] before training the data with multilinear regression. But when tried to convert the employer names into binary,the csv format of the dataset got corrupted as it was creating 12,000X12,000 square matrix. Basically what python panda s functionality is that when we pass asset of values as input that gives a sparse square matrix of size n X n.as there are around different so finally we had an option of doing random sampling and train the model using multilinear regression. We did a random sampling of the data and trained the model using multilinear regression and the R mean squared error was found to be 0.86.Higher the value of R mean squared error better the model. Fig 7 Salary Distribution of H1-B Employees From the above figure we can say that most of the salaries of the H1-b workers are in the range of which is class 2.The class 1 in the above graph represents number of employees with salary less than and the class 3 represents the employees with salary greater than A. Naïve Bayes Classifier: VI. APPLIED MODELS We removed the outliers and divided the prevailing wage into nine classes and starting from 16,000 to 13,0000 and the width of each of classes was calculated using normalization i.e we used mean and standard deviation to calculate the width of each class and wanted to know in which of the nine classes each of the prevailing wage was falling into. Built the naive Bayes classifier model but accuracy was around 40% which is very less. So we divided the prevailing wage into three classes and trained our data using Naïve Bayes classifier, but still we accuracy was around 50%. C.Support Vector Machines As we know support vector machine is classification algorithm which classifies the two classes. As our target class in this model has three classes we have trained our model using one against many classes. Random sampling of the data is done with Caret package [11] in R and each sample taken is divided into training classes and testing class with ration 80 and 20 respectively. The one difficulty we have encountered after doing the random sampling is the error class not found.as we have around different employer names test set contained the employers which are not present in the training set. So even after random sampling we iterated over the test set and removed the values which are present only in testset.this method helped us in overcoming the problem of new levels present in the test set alone. Trained the model using one against many classes based support vector machines [12] approach and obtained an accuracy of 95.84% D. Decision Trees Trained the model using decision tree [13] machine learning algorithm and obtained an accuracy of 94.94%. When tried to plot the decision tree, the predictor variables with more than 52 levels was not printed. This was the limitation in R. Even
5 5 when tried to plot the decision tree then the tree was visualized but we cannot decode the rules corresponding to the Decision Tree. The decision tree which was printed in R is as below. Fig 8. Fig 8 Decision Tree Plot E.Text Based Analysis Big ml [14]is one of the machine learning website where user can create a login and upload the data set of their interest. Once the dataset is uploaded it provide us different ways to create a data set by selecting the required columns from the original dataset and gives us the option to select the complete data or do the random sampling. Once the data set is created We can analyze different columns based on the visualizations generated.ref: Fig 9 Fig 10. Data path Visualization in Big Ml VII. Limitations of R For a large dataset converting of categorical values into numeric was a big question. Where we have to assign labels for each of the factors. Assigning labels to categorical variable which has 12,000 levels is tedious process. We cannot train the dataset using random forest in R, if the dataset contains the categorical variables with more than 32 levels.it cannot handle categorical predictors with more than 32 categories. When plotted the decision tree, the predictor variables with more than 52 levels was not printed. We could not interpret the rules of the decision tree. Visualization is a limitation in R. It is very difficult to connect the model to the front end and get the input from the user and to pass them to the trained model where it is simple in Python. Fig 9 Column Visualization in Big ML We can train our data set with different models and results will be predicted once the model is trained. We have used text based analysis model to analyze and predict the salaries of H1- B employees. The predicted model is viewed and rules can be decoded from the model. For example if we want the wage corresponding to the software engineer in the Facebook who earns average of 104 k. The tree generated by the model is very UI friendly where you can zoom at a particular node and get the rule corresponding to that node.
6 6 Model Naïve Bayes Classifier (one against many) Support Vector Machines (random sampling) VIII RESULTS Accuracy 83% % Decision Trees 94.94% Multilinear Regression (random sampling) IX CONCLUSION R -squared error:0.56 As per our decision tree text analysis method California is the state with highest average wage and the most important factor in predicting the wage of the employee is Job title followed by location and next to these two comes the Employer name. If there are multiple classes in the target variable Naïve Bayes One against Many classes always gives better results compared to Naïve Bayes Method. [8] S. Rana and A. Singh, "Comparative analysis of sentiment orientation using SVM and Naive Bayes techniques," nd International Conference on Next Generation Computing Technologies (NGCT), Dehradun, 2016, pp [9] otes/401-multreg.pdf [10] [11] t.pdf [12] I. Dilrukshi and K. De Zoysa, "Twitter news classification: Theoretical and practical comparison of SVM against Naive Bayes algorithms," 2013 International Conference on Advances in ICT for Emerging Regions (ICTer), Colombo, 2013,pp [13]J.R. Quinlan, "Induction of Decision Trees" in, Boston:Kluwer Academic Publishers, vol. 1, pp , [14] Decision Tree gives us very good result but if we have more factors and levels it is difficult to decode rules Text analysis for this data set worked pretty well as we can infer more results and rule from the data X REFERENCES [1]. [2] 2.pdf [4] P. Khongchai and P. Songmuang, "Improving students'motivation to study using salary prediction system," th International Joint Conference on Computer Science and Software Engineering (JCSSE), Khon Kaen, 2016, pp [5]Z. J. Kovacic, "Early Prediction of Student Success: Mining Students Enrolment Data", Proceedings of Informing Science & IT Education Conference (InSITE), [6] G. Forman, "An Extensive Empirical Study of Feature Selection Metrics for Text Classification", Journal of Machine Learning Research, vol. 3, pp , [7]
Understanding General Trends in Permanent Visa Applications and Predicting Visa Decisions using SAS Enterprise Miner.
Understanding General Trends in Permanent Visa Applications and Predicting Visa Decisions using SAS Enterprise Miner. ARUN TEJA BAIREDDLAPALLI KRISHNA REDDY OKLAMOHA STATE UNIVERSITY Contents ABSTRACT...
More informationWho Is Likely to Succeed: Predictive Modeling of the Journey from H-1B to Permanent US Work Visa
Who Is Likely to Succeed: Predictive Modeling of the Journey from H-1B to Shibbir Dripto Khan ABSTRACT The purpose of this Study is to help US employers and legislators predict which employees are most
More informationProgress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong
Progress Report: Predicting Which Recommended Content Users Click Stanley Jacob, Lingjie Kong Machine learning models can be used to predict which recommended content users will click on a given website.
More informationApplying Regression Techniques For Predictive Analytics Paviya George Chemparathy
Applying Regression Techniques For Predictive Analytics Paviya George Chemparathy AGENDA 1. Introduction 2. Use Cases 3. Popular Algorithms 4. Typical Approach 5. Case Study 2016 SAPIENT GLOBAL MARKETS
More informationExperiences in the Use of Big Data for Official Statistics
Think Big - Data innovation in Latin America Santiago, Chile 6 th March 2017 Experiences in the Use of Big Data for Official Statistics Antonino Virgillito Istat Introduction The use of Big Data sources
More informationBig Data Mining in Twitter using Python Clients
University of New Orleans ScholarWorks@UNO Innovate UNO InnovateUNO Fall 2017 Big Data Mining in Twitter using Python Clients Sanjiv Pradhanang University of New Orleans Follow this and additional works
More informationPOST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM)
OUTLINE FOR THE POST GRADUATE PROGRAM IN DATA SCIENCE & MACHINE LEARNING (PGPDM) Module Subject Topics Learning outcomes Delivered by Exploratory & Visualization Framework Exploratory Data Collection and
More informationFINAL PROJECT REPORT IME672. Group Number 6
FINAL PROJECT REPORT IME672 Group Number 6 Ayushya Agarwal 14168 Rishabh Vaish 14553 Rohit Bansal 14564 Abhinav Sharma 14015 Dil Bag Singh 14222 Introduction Cell2Cell, The Churn Game. The cellular telephone
More informationData Mining Applications with R
Data Mining Applications with R Yanchang Zhao Senior Data Miner, RDataMining.com, Australia Associate Professor, Yonghua Cen Nanjing University of Science and Technology, China AMSTERDAM BOSTON HEIDELBERG
More informationPredicting Customer Purchase to Improve Bank Marketing Effectiveness
Business Analytics Using Data Mining (2017 Fall).Fianl Report Predicting Customer Purchase to Improve Bank Marketing Effectiveness Group 6 Sandy Wu Andy Hsu Wei-Zhu Chen Samantha Chien Instructor:Galit
More informationDrive Better Insights with Oracle Analytics Cloud
Drive Better Insights with Oracle Analytics Cloud Thursday, April 5, 2018 Speakers: Jason Little, Sean Suskind Copyright 2018 Sierra-Cedar, Inc. All rights reserved Today s Presenters Jason Little VP of
More informationPREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING
PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING Abbas Heiat, College of Business, Montana State University, Billings, MT 59102, aheiat@msubillings.edu ABSTRACT The purpose of this study is to investigate
More informationApplications of Machine Learning to Predict Yelp Ratings
Applications of Machine Learning to Predict Yelp Ratings Kyle Carbon Aeronautics and Astronautics kcarbon@stanford.edu Kacyn Fujii Electrical Engineering khfujii@stanford.edu Prasanth Veerina Computer
More informationNew Customer Acquisition Strategy
Page 1 New Customer Acquisition Strategy Based on Customer Profiling Segmentation and Scoring Model Page 2 Introduction A customer profile is a snapshot of who your customers are, how to reach them, and
More informationUsing decision tree classifier to predict income levels
MPRA Munich Personal RePEc Archive Using decision tree classifier to predict income levels Sisay Menji Bekena 30 July 2017 Online at https://mpra.ub.uni-muenchen.de/83406/ MPRA Paper No. 83406, posted
More informationData Science Training Course
About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over
More informationFrom Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here.
From Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques. Full book available for purchase here. Contents List of Figures xv Foreword xxiii Preface xxv Acknowledgments xxix Chapter
More informationRestaurant Recommendation for Facebook Users
Restaurant Recommendation for Facebook Users Qiaosha Han Vivian Lin Wenqing Dai Computer Science Computer Science Computer Science Stanford University Stanford University Stanford University qiaoshah@stanford.edu
More informationData Analytics on a Yelp Data Set. Maitreyi Tata. B.Tech., Gitam University, India, 2015 A REPORT
Data Analytics on a Yelp Data Set by Maitreyi Tata B.Tech., Gitam University, India, 2015 A REPORT submitted in partial fulfillment of the requirements for the degree MASTER OF SCIENCE Department of Computer
More informationAirbnb Capstone: Super Host Analysis
Airbnb Capstone: Super Host Analysis Justin Malunay September 21, 2016 Abstract This report discusses the significance of Airbnb s Super Host Program. Based on Airbnb s open data, I was able to predict
More informationPredicting Restaurants Rating And Popularity Based On Yelp Dataset
CS 229 MACHINE LEARNING FINAL PROJECT 1 Predicting Restaurants Rating And Popularity Based On Yelp Dataset Yiwen Guo, ICME, Anran Lu, ICME, and Zeyu Wang, Department of Economics, Stanford University Abstract
More informationStrength in numbers? Modelling the impact of businesses on each other
Strength in numbers? Modelling the impact of businesses on each other Amir Abbas Sadeghian amirabs@stanford.edu Hakan Inan inanh@stanford.edu Andres Nötzli noetzli@stanford.edu. INTRODUCTION In many cities,
More informationPredicting Airbnb Bookings by Country
Michael Dimitras A12465780 CSE 190 Assignment 2 Predicting Airbnb Bookings by Country 1: Dataset Description For this assignment, I selected the Airbnb New User Bookings set from Kaggle. The dataset is
More informationBusiness Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee
Business Analytics & Data Mining Modeling Using R Dr. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 02 Data Mining Process Welcome to the lecture 2 of
More informationMachine Learning Models for Sales Time Series Forecasting
Article Machine Learning Models for Sales Time Series Forecasting Bohdan M. Pavlyshenko SoftServe, Inc., Ivan Franko National University of Lviv * Correspondence: bpavl@softserveinc.com, b.pavlyshenko@gmail.com
More informationData Visualization and Improving Accuracy of Attrition Using Stacked Classifier
Data Visualization and Improving Accuracy of Attrition Using Stacked Classifier 1 Deep Sanghavi, 2 Jay Parekh, 3 Shaunak Sompura, 4 Pratik Kanani 1-3 Students, 4 Assistant Professor 1 Information Technology
More informationPreface to the third edition Preface to the first edition Acknowledgments
Contents Foreword Preface to the third edition Preface to the first edition Acknowledgments Part I PRELIMINARIES XXI XXIII XXVII XXIX CHAPTER 1 Introduction 3 1.1 What Is Business Analytics?................
More informationSAP Predictive Analytics Suite
SAP Predictive Analytics Suite Tania Pérez Asensio Where is the Evolution of Business Analytics Heading? Organizations Are Maturing Their Approaches to Solving Business Problems Reactive Wait until a problem
More informationConvex and Non-Convex Classification of S&P 500 Stocks
Georgia Institute of Technology 4133 Advanced Optimization Convex and Non-Convex Classification of S&P 500 Stocks Matt Faulkner Chris Fu James Moriarty Masud Parvez Mario Wijaya coached by Dr. Guanghui
More informationCSE 255 Lecture 3. Data Mining and Predictive Analytics. Supervised learning Classification
CSE 255 Lecture 3 Data Mining and Predictive Analytics Supervised learning Classification Last week Last week we started looking at supervised learning problems Last week We studied linear regression,
More informationPredicting the Odds of Getting Retweeted
Predicting the Odds of Getting Retweeted Arun Mahendra Stanford University arunmahe@stanford.edu 1. Introduction Millions of people tweet every day about almost any topic imaginable, but only a small percent
More informationAn Implementation of genetic algorithm based feature selection approach over medical datasets
An Implementation of genetic algorithm based feature selection approach over medical s Dr. A. Shaik Abdul Khadir #1, K. Mohamed Amanullah #2 #1 Research Department of Computer Science, KhadirMohideen College,
More informationAnalytics for Banks. September 19, 2017
Analytics for Banks September 19, 2017 Outline About AlgoAnalytics Problems we can solve for banks Our experience Technology Page 2 About AlgoAnalytics Analytics Consultancy Work at the intersection of
More informationPredicting Yelp Ratings From Business and User Characteristics
Predicting Yelp Ratings From Business and User Characteristics Jeff Han Justin Kuang Derek Lim Stanford University jeffhan@stanford.edu kuangj@stanford.edu limderek@stanford.edu I. Abstract With online
More informationAirbnb Price Estimation. Hoormazd Rezaei SUNet ID: hoormazd. Project Category: General Machine Learning gitlab.com/hoorir/cs229-project.
Airbnb Price Estimation Liubov Nikolenko SUNet ID: liubov Hoormazd Rezaei SUNet ID: hoormazd Pouya Rezazadeh SUNet ID: pouyar Project Category: General Machine Learning gitlab.com/hoorir/cs229-project.git
More informationPrediction of Google Local Users Restaurant ratings
CSE 190 Assignment 2 Report Professor Julian McAuley Page 1 Nov 30, 2015 Prediction of Google Local Users Restaurant ratings Shunxin Lu Muyu Ma Ziran Zhang Xin Chen Abstract Since mobile devices and the
More informationLinear model to forecast sales from past data of Rossmann drug Store
Abstract Linear model to forecast sales from past data of Rossmann drug Store Group id: G3 Recent years, the explosive growth in data results in the need to develop new tools to process data into knowledge
More informationPREDICTION OF PIPE PERFORMANCE WITH MACHINE LEARNING USING R
PREDICTION OF PIPE PERFORMANCE WITH MACHINE LEARNING USING R Name: XXXXXX Student Number: XXXXXXX 2016-11-29 1. Instruction As one of the most important infrastructures in cities, water mains buried underground
More informationPredicting Corporate 8-K Content Using Machine Learning Techniques
Predicting Corporate 8-K Content Using Machine Learning Techniques Min Ji Lee Graduate School of Business Stanford University Stanford, California 94305 E-mail: minjilee@stanford.edu Hyungjun Lee Department
More informationPredicting Corporate Influence Cascades In Health Care Communities
Predicting Corporate Influence Cascades In Health Care Communities Shouzhong Shi, Chaudary Zeeshan Arif, Sarah Tran December 11, 2015 Part A Introduction The standard model of drug prescription choice
More informationCryptocurrency Price Prediction Using News and Social Media Sentiment
Cryptocurrency Price Prediction Using News and Social Media Sentiment Connor Lamon, Eric Nielsen, Eric Redondo Abstract This project analyzes the ability of news and social media data to predict price
More informationA STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET
A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET 1 J.JEYACHIDRA, M.PUNITHAVALLI, 1 Research Scholar, Department of Computer Science and Applications,
More informationCS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow
CS229 Project Report Using Newspaper Sentiments to Predict Stock Movements Hao Yee Chan Anthony Chow haoyeec@stanford.edu ac1408@stanford.edu Problem Statement It is often said that stock prices are determined
More informationAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms Jieming Wei Sharon Zhang Introduction Many organizations prospect for loyal supporters and donors by sending direct mail appeals. This is an
More informationUsing Decision Tree to predict repeat customers
Using Decision Tree to predict repeat customers Jia En Nicholette Li Jing Rong Lim Abstract We focus on using feature engineering and decision trees to perform classification and feature selection on the
More informationFORECASTING of WALMART SALES using MACHINE LEARNING ALGORITHMS
FORECASTING of WALMART SALES using MACHINE LEARNING ALGORITHMS 1 Nikhil Sunil Elias, 2 Seema Singh 1 Student, Department of Electronics and Communication, BMS Institute of Technology and Management 2 Professor,
More informationEnhanced Cost Sensitive Boosting Network for Software Defect Prediction
Enhanced Cost Sensitive Boosting Network for Software Defect Prediction Sreelekshmy. P M.Tech, Department of Computer Science and Engineering, Lourdes Matha College of Science & Technology, Kerala,India
More informationHow much is my car worth? A methodology for predicting used cars prices using Random Forest
How much is my car worth? A methodology for predicting used cars prices using Random Forest Nabarun Pal Department of Metallurgical and Materials Engineering Indian Institute of Technology Roorkee Roorkee,
More informationBig Data. Methodological issues in using Big Data for Official Statistics
Giulio Barcaroli Istat (barcarol@istat.it) Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official Statistics. Methodological issues in using Big Data for Official Statistics
More informationNew restaurants fail at a surprisingly
Predicting New Restaurant Success and Rating with Yelp Aileen Wang, William Zeng, Jessica Zhang Stanford University aileen15@stanford.edu, wizeng@stanford.edu, jzhang4@stanford.edu December 16, 2016 Abstract
More informationUnravelling Airbnb Predicting Price for New Listing
Unravelling Airbnb Predicting Price for New Listing Paridhi Choudhary H John Heinz III College Carnegie Mellon University Pittsburgh, PA 15213 paridhic@andrew.cmu.edu Aniket Jain H John Heinz III College
More informationAppendix (Additional Materials for Electronic Media of the Journal) I. Variable Definition, Means and Standard Deviations
1 Appendix (Additional Materials for Electronic Media of the Journal) I. Variable Definition, Means and Standard Deviations Table A1 provides the definition of variables, and the means and standard deviations
More information2015 The MathWorks, Inc. 1
2015 The MathWorks, Inc. 1 MATLAB 을이용한머신러닝 ( 기본 ) Senior Application Engineer 엄준상과장 2015 The MathWorks, Inc. 2 Machine Learning is Everywhere Solution is too complex for hand written rules or equations
More informationLet the data speak: Machine learning methods for data editing and imputation
Working Paper 31 UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing (Budapest, Hungary, 14-16 September 2015) Topic (v): Emerging
More informationWatts App: An Energy Analytics and Demand-Response Advisor Tool
Watts App: An Energy Analytics and Demand-Response Advisor Tool Santiago Gonzalez, Case Western Reserve University, Electrical Engineering, SUNFEST Fellow Dr. Rahul Mangharam, Electrical and Systems Engineering
More informationASSIGNMENT SUBMISSION FORM
ASSIGNMENT SUBMISSION FORM Treat this as the first page of your assignment Course Name: Assignment Title: Business Analytics using Data Mining Crowdanalytix - Predicting Churn/Non-Churn Status of a Consumer
More informationData Analysis Boot Camp
Data Analysis Boot Camp DATA200; 3 Days, Instructor-led Course Description Today's organizations face both a promise and a dilemma. The growth in availability and quantity of data, as well as the tools
More informationReal Data Analysis at PNC
Real Data Analysis at PNC Zhifeng Wang Department of Statistics Florida State University PNC Bank PNC is a Pittsburgh-based financial services corporation. I worked in Marketing Department = Decision,
More informationBrian Macdonald Big Data & Analytics Specialist - Oracle
Brian Macdonald Big Data & Analytics Specialist - Oracle Improving Predictive Model Development Time with R and Oracle Big Data Discovery brian.macdonald@oracle.com Copyright 2015, Oracle and/or its affiliates.
More informationFinal Project Report CS224W Fall 2015 Afshin Babveyh Sadegh Ebrahimi
Final Project Report CS224W Fall 2015 Afshin Babveyh Sadegh Ebrahimi Introduction Bitcoin is a form of crypto currency introduced by Satoshi Nakamoto in 2009. Even though it only received interest from
More informationPredictive Modelling for Customer Targeting A Banking Example
Predictive Modelling for Customer Targeting A Banking Example Pedro Ecija Serrano 11 September 2017 Customer Targeting What is it? Why should I care? How do I do it? 11 September 2017 2 What Is Customer
More informationHow hot will it get? Modeling scientific discourse about literature
How hot will it get? Modeling scientific discourse about literature Project Aims Natalie Telis, CS229 ntelis@stanford.edu Many metrics exist to provide heuristics for quality of scientific literature,
More informationConclusions and Future Work
Chapter 9 Conclusions and Future Work Having done the exhaustive study of recommender systems belonging to various domains, stock market prediction systems, social resource recommender, tag recommender
More informationA Survey on Recommendation Techniques in E-Commerce
A Survey on Recommendation Techniques in E-Commerce Namitha Ann Regi Post-Graduate Student Department of Computer Science and Engineering Karunya University, India P. Rebecca Sandra Assistant Professor
More informationSunnie Chung. Cleveland State University
Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:
More informationAP Statistics Part 1 Review Test 2
Count Name AP Statistics Part 1 Review Test 2 1. You have a set of data that you suspect came from a normal distribution. In order to assess normality, you construct a normal probability plot. Which of
More informationKnowledgeENTERPRISE FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK. Advanced Analytics on Spark BROCHURE
FAST TRACK YOUR ACCESS TO BIG DATA WITH ANGOSS ADVANCED ANALYTICS ON SPARK Are you drowning in Big Data? Do you lack access to your data? Are you having a hard time managing Big Data processing requirements?
More informationIBM SPSS & Apache Spark
IBM SPSS & Apache Spark Making Big Data analytics easier and more accessible ramiro.rego@es.ibm.com @foreswearer 1 2016 IBM Corporation Modeler y Spark. Integration Infrastructure overview Spark, Hadoop
More informationMISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE
MISSING DATA CLASSIFICATION OF CHRONIC KIDNEY DISEASE Wala Abedalkhader and Noora Abdulrahman Department of Engineering Systems and Management, Masdar Institute of Science and Technology, Abu Dhabi, United
More informationSAS Machine Learning and other Analytics: Trends and Roadmap. Sascha Schubert Sberbank 8 Sep 2017
SAS Machine Learning and other Analytics: Trends and Roadmap Sascha Schubert Sberbank 8 Sep 2017 How Big Analytics will Change Organizations Optimization and Innovation Optimizing existing processes Customer
More informationPREDICTION OF SOCIAL NETWORK SITES USING WEKA TOOL
PREDICTION OF SOCIAL NETWORK SITES USING WEKA TOOL G.Thirumani Aatthi 1, R.Aishwarya 2, R.Mallika 3, A.Angel 4 1 Assistant Professor, 2,3,4 M.Sc(CS&IT), Department of Computer Science & Information Technology,
More informationMovie Success Prediction PROJECT REPORT. Rakesh Parappa U CS660
Movie Success Prediction PROJECT REPORT Rakesh Parappa U01382090 CS660 Abstract The report entails analyzing different variables like movie budget, actor s Facebook likes, director s Facebook likes and
More informationPractices of Business Intelligence
Tamkang University Practices of Business Intelligence II Tamkang University (Descriptive Analytics II: Business Intelligence and Data Warehousing) 1071BI05 MI4 (M2084) (2888) Wed, 7, 8 (14:10-16:00) (B217)
More informationDATA SCIENCE: HYPE AND REALITY PATRICK HALL
DATA SCIENCE: HYPE AND REALITY PATRICK HALL About me SAS Enterprise Miner, 2012 Cloudera Data Scientist, 2014 Do you use Kolmogorov Smirnov often? Statistician No, I mix my martinis with gin. Data Scientist
More informationModule - 01 Lecture - 03 Descriptive Statistics: Graphical Approaches
Introduction of Data Analytics Prof. Nandan Sudarsanam and Prof. B. Ravindran Department of Management Studies and Department of Computer Science and Engineering Indian Institution of Technology, Madras
More informationEvaluation of Machine Learning Algorithms for Satellite Operations Support
Evaluation of Machine Learning Algorithms for Satellite Operations Support Julian Spencer-Jones, Spacecraft Engineer Telenor Satellite AS Greg Adamski, Member of Technical Staff L3 Technologies Telemetry
More informationAPPRENTICESHIP. Apprentice Employer Training Program Sponsor Warren County Career Center Your Local Educational Agency. Page 1
APPRENTICESHIP Apprentice Employer Training Program Sponsor Warren County Career Center Your Local Educational Agency Page 1 Table of Contents Statement of Purpose....2 What is an Apprenticeship?...3 What
More informationSentiment analysis using Singular Value Decomposition
International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2016 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Veena
More informationAssignment 1 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran
Assignment 1 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran 1. In inferential statistics, the aim is to: (a) learn the properties of the sample by calculating statistics
More informationUsing Text Mining and Machine Learning to Predict the Impact of Quarterly Financial Results on Next Day Stock Performance.
Using Text Mining and Machine Learning to Predict the Impact of Quarterly Financial Results on Next Day Stock Performance Itamar Snir The Leonard N. Stern School of Business Glucksman Institute for Research
More informationHow to build and deploy machine learning projects
How to build and deploy machine learning projects Litan Ilany, Advanced Analytics litan.ilany@intel.com Agenda Introduction Machine Learning: Exploration vs Solution CRISP-DM Flow considerations Other
More informationPredictive Analytics Using Support Vector Machine
International Journal for Modern Trends in Science and Technology Volume: 03, Special Issue No: 02, March 2017 ISSN: 2455-3778 http://www.ijmtst.com Predictive Analytics Using Support Vector Machine Ch.Sai
More informationPERM FREQUENTLY ASKED QUESTIONS
Background: The first step in most employment based immigration processes is the filing of an Application for Alien Employment Certification or Labor Certification. The process by which this application
More informationDATA SETS. What were the data collection process. When, where, and how? PRE-PROCESSING. Natural language data pre-processing steps
01 DATA SETS What were the data collection process. When, where, and how? 02 PRE-PROCESSING Natural language data pre-processing steps 03 RULE-BASED CLASSIFICATIONS Without using machine learning, can
More informationData mining: Identify the hidden anomalous through modified data characteristics checking algorithm and disease modeling By Genomics
Data mining: Identify the hidden anomalous through modified data characteristics checking algorithm and disease modeling By Genomics PavanKumar kolla* kolla.haripriyanka+ *School of Computing Sciences,
More informationA Study of Financial Distress Prediction based on Discernibility Matrix and ANN Xin-Zhong BAO 1,a,*, Xiu-Zhuan MENG 1, Hong-Yu FU 1
International Conference on Management Science and Management Innovation (MSMI 2014) A Study of Financial Distress Prediction based on Discernibility Matrix and ANN Xin-Zhong BAO 1,a,*, Xiu-Zhuan MENG
More informationAnalytical Capability Security Compute Ease Data Scale Price Users Traditional Statistics vs. Machine Learning In-Memory vs. Shared Infrastructure CRAN vs. Parallelization Desktop vs. Remote Explicit vs.
More informationFraud Detection for MCC Manipulation
2016 International Conference on Informatics, Management Engineering and Industrial Application (IMEIA 2016) ISBN: 978-1-60595-345-8 Fraud Detection for MCC Manipulation Hong-feng CHAI 1, Xin LIU 2, Yan-jun
More informationMACHINE LEARNING BASED ELECTRICITY DEMAND FORECASTING
MACHINE LEARNING BASED ELECTRICITY DEMAND FORECASTING Zeynep Çamurdan, Murat Can Ganiz Department of Computer Engineering, Marmara University Istanbul/Turkey {zeynep.camurdan, murat.ganiz}@marmara.edu.tr
More informationReal Estate Appraisal
Real Estate Appraisal CS229 Machine Learning Final Project Writeup David Chanin, Ian Christopher, and Roy Fejgin December 10, 2010 Abstract This is our final project for Machine Learning (CS229) during
More informationDatameer for Data Preparation: Empowering Your Business Analysts
Datameer for Data Preparation: Empowering Your Business Analysts As businesses strive to be data-driven organizations, self-service data preparation becomes a critical cog in the analytic process. Self-service
More informationData Warehousing Class Project Report
Portland State University PDXScholar Engineering and Technology Management Student Projects Engineering and Technology Management Winter 2018 Data Warehousing Class Project Report Gaya Haciane Portland
More information25 th Meeting of the Wiesbaden Group on Business Registers - International Roundtable on Business Survey Frames. Tokyo, 8 11 November 2016.
25 th Meeting of the Wiesbaden Group on Business Registers - International Roundtable on Business Survey Frames Tokyo, 8 11 November 2016 Michael E. Kornbau U.S. Census Bureau Session No. 5 Technology
More informationCan Cascades be Predicted?
Can Cascades be Predicted? Rediet Abebe and Thibaut Horel September 22, 2014 1 Introduction In this presentation, we discuss the paper Can Cascades be Predicted? by Cheng, Adamic, Dow, Kleinberg, and Leskovec,
More informationMAACCE 2014 Annual Conference May 8, Jones Nhinson Williams, BLS Programs Administrator Office of Workforce Information and Performance
MAACCE 2014 Annual Conference May 8, 2014 Jones Nhinson Williams, BLS Programs Administrator Office of Workforce Information and Performance Goal: Assist participants learn and understand how to identify
More informationGetting Started with OptQuest
Getting Started with OptQuest What OptQuest does Futura Apartments model example Portfolio Allocation model example Defining decision variables in Crystal Ball Running OptQuest Specifying decision variable
More informationData mining and Renewable energy. Cindi Thompson
Data mining and Renewable energy Cindi Thompson June 2012 Analytics, Big Data, and Data Science 1 What is Analytics? makes extensive use of data, statistical and quantitative analysis, explanatory and
More informationData Analytics Training Program using
Data Analytics Training Program using In exclusive association with 1200+ Trainings 20,000+ Participants 10,000+ Brands 45+ Countries [Since 2009] Training partner for Who Is This Course For? Programers
More informationCHAPTER ONE: OVEVIEW OF MANAGERIAL ACCOUNTING
CHAPTER ONE: OVEVIEW OF MANAGERIAL ACCOUNTING The Basic Objectives of Accounting Basic objective of accounting is to provide stakeholders with useful information about a business enterprise in order to
More informationKENT STATE UNIVERSITY GUIDELINES FOR RECRUITING AND HIRING INTERNATIONAL PROFESSIONALS & FACULTY MEMBERS
KENT STATE UNIVERSITY GUIDELINES FOR RECRUITING AND HIRING INTERNATIONAL PROFESSIONALS & FACULTY MEMBERS Suggested practices to preserve the applicant s ability to apply for permanent residency ( green
More information