Logistic Regression Analysis

Size: px
Start display at page:

Download "Logistic Regression Analysis"

Transcription

1 Logistic Regression Analysis What is a Logistic Regression Analysis? Logistic Regression (LR) is a type of statistical analysis that can be performed on employer data. LR is used to examine the effects that external factors have on a selection decision. How Does Logistic Regression Help? LR allows an employer to challenge an allegation of Adverse Impact (AI) in a selection process. By making it possible to demonstrate a link between hiring decisions and relevant, jobrelated factors, LR can provide employers with a valuable rebuttal to a charge of AI.

2 How is Logistic Regression Different? In a standard 2x2 selection rate test (such as the Fisher s Exact Test), only two variables are considered The selection variable (e.g. Hired) The at-issue group variable (e.g. Gender) This provides a good first look, but fails to take into account any explanatory factors. In contrast, an LR analysis can examine multiple variables at the same time. Aside: When is a Logistic Regression analysis appropriate? LR is a very powerful statistical tool, and as such care needs to be taken to make sure it is used appropriately. LR should only be used when a standard selection rate test has already given a statistically significant indication of Adverse Impact. The purpose of LR for an employer is to use valid job-related factors to explain the difference in selection rates. When is a Logistic Regression analysis appropriate? The strength of LR is in its use of multiple explanatory variables, but this can be abused by constructing an erroneous model (using incorrect variables) which indicates a theoretical shortfall in a selection process, even if there is no practical shortfall.

3 When is a Logistic Regression analysis appropriate? As such LR should not be the primary threshold indicator used, otherwise employers would have to account for all potential problems even in the absence of practical problems. What Do I Need for a Logistic Regression Analysis? An LR analysis will require more data than a standard 2x2 selection test An LR analysis requires data for each field that the employer wishes to examine. Hired Gender Prior Experience Etc. Note: The fields used need to be directly related to the selection decision at hand. How do I Build my Data for a Logistic Regression Analysis? Step 1: Verify the data Make sure all necessary fields are present Make sure original values are understood

4 How do I Build my Data for a Logistic Regression Analysis? Step 2: Check data integrity Generate a frequency table for the fields that will be used. Check for missing values o If there are any, determine how they will be handled Make sure dichotomous (two-valued) variables are actually dichotomous. How do I Build my Data for a Logistic Regression Analysis? Step 3: Recode fields as necessary In order to run the LR, all fields must be coded into numeric values. Many fields will be dummy coded, most frequently assigning values of 1 and 0 in place of text values. When dummy coding, the focal or at-issue group should be coded with 0, while the reference group should be coded 1. o E.g. Gender values of Male and Female would generally be coded to 1 and 0 respectively. How do I Build my Data for a Logistic Regression Analysis? Step 3: Recode fields as necessary Some fields will be better handled with multiple dummy values o E.g. Education may be coded High School = 1, AA Degree = 2, BA/BS Degree = 3, etc. Ranked dummy codes are appropriate when there is a clear order to the original values.

5 How do I Build my Data for a Logistic Regression Analysis? Step 3: Recode fields as necessary Rarely, more complex dummy coding may be justified. This generally involves applying mathematical transformations to the numeric values already present, and is generally best handled by an expert after constructing the remainder of the LR analysis. How do I construct my Logistic Regression model? Constructing an LR model for a selection decision is a highly iterative task. Out of all the factors that may affect the selection decision, the best LR model will include only those which are statistically or practically important. How do I construct my Logistic Regression Model? Variables which are practically important are those factors which are explicitly known to be considered in the selection decision E.g., Hiring managers may state that they prefer candidates with a minimum level of experience (for example, 2 years), but they do not require it for the position.

6 How do I construct my Logistic Regression Model? Variables which are statistically important are found by performing a univariate analysis of each potential predictor against the selection decision. E.g. a company may have records for applicants prior work experience, education level, and score on an employment screening test. Each of these would need to be dummy coded and analyzed against the decision variable. How do I construct my Logistic Regression Model? Unlike a standard statistical test, potential predictor variables need not be statistically significant (p <.05) to be included in a model. In constructing the model, p <.25 is a reasonable guideline for preliminary inclusion. How do I construct my Logistic Regression Model? After the practical and statistical variables are determined, the model is created iteratively through selecting some or all of the variables, including them in the analysis, and evaluating the fit of the resulting model.

7 Example: Company XYZ receives 100 applications for a position for which it has 30 openings Half the applicants are male and half are female Ultimately, 25 males are hired and 5 females are hired. This clearly violates the 80% rule of thumb, and is statistically significant as well. XYZ is surprised by this, and investigates what factors contributed to the hiring Factors: XYZ examines its hiring process, and identifies three factors (Factor1, Factor2, and Factor3) that are important considerations in the process. XYZ examines its applicant data, and observes some key statistics: Factor1: While 51 out of the 100 (51%) applicants were identified as having Factor1, that factor was not evenly distributed between men and women: 42 out of the 50 men (84%) had Factor1, while only 9 of the 50 women (18%) had the factor. This means men were roughly four and a half times more likely than women to have Factor1.

8 Factor1: Factor 1 by Gender Number of Applicants FEMALE MALE 5 0 NO Has Factor 1 YES Factor2: Statistics for Factor2 were similar to Factor1: 55 out of 100 (55%)applicants had Factor2 o 43 out of 50 men (86%) had Factor2, while o 12 out of 50 women (24%) had Factor2. Men were found to be roughly three and a half times more likely than women to have Factor2. Factor2: Factor 2 by Gender Number of Applicants FEMALE MALE 0 NO Has Factor 1 YES

9 Factor3: And again for Factor3: 45 out of 100 (45%) applicants had Factor3 o 40 out of 50 men (80%) had Factor3, while o 5 out of 50 women (10%) had Factor3. Men were found to be eight times more likely to have Factor3. Factor3: Factor 3 by Gender Number of Applicants FEMALE MALE 0 NO Has Factor 1 YES What Next?: XYZ wonders if these statistical differences in its applicant group could explain why so many more men than women were hired. Because XYZ is concerned with the effects of external factors on a selection decision where AI has already been established, a Logistic Regression analysis is appropriate.

10 Analysis Set-Up: Before an analysis can be performed, the applicant data must first be dummy coded, giving fields such as: A Hired field, where a value of 1 indicates that the applicant was hired and a value of 0 indicates that they were not. A Gender field, where 1 indicates Male and 0 indicates Female A field for each factor (Factor1 through Factor3), using 1 to indicate the applicant has the factor and 0 to show the applicant lacks the factor. Analysis (Baseline): In order to examine the effects of the three factors, XYZ needs to establish a baseline for comparison. Using a statistical tool like SPSS or SAS, XYZ performs a Binomial Logistic Regression using the decision variable (e.g. Hired ) as the dependent variable, and the at-issue variable (in this case Gender ) as the covariate. The significance value This results in a table like the following: for gender is less than Variables in the Equation.05, so Gender is B S.E. Wald df Sig. Exp(B) statistically significantly Step 1(a) Gender linked to being hired. Constant a Variable(s) entered on step 1: Gender. Analysis (Factor1): Once the baseline is established, XYZ should run at least three additional LR analyses, one for each of the factors. These analyses should be set up the same as the baseline, except that the factor in question should be added to the list of covariates. This Variables results in the Equation in a table like the following (for The Factor1): significance value for B S.E. Wald df Sig. Exp(B) gender is no longer less Step 1(a) Gender than.05. This means that Factor Constant the differences in Factor1 a Variable(s) entered on step 1: Gender, Factor1. between men and women explain away the differences in hiring rates.

11 Analysis (Factors 2 and 3): The result tables for the analyses for Factor2 and Factor3 are as follows: Factor2: Variables in the Equation Factor3: B S.E. Wald df Sig. Exp(B) Step 1(a) Gender Factor Constant a Variable(s) entered on step 1: Gender, Factor2. Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 1(a) Gender Factor Constant a Variable(s) entered on step 1: Gender, Factor3. The significance value for gender is still less than.05. This means that the Factor2 alone is insufficient to explain away the differences in hiring rates. The significance value for Factor3 is, like that of Factor1, greater than.05, so Factor3, unlike Factor2, is sufficient to explain the differences in hiring rates. Analysis Results Summary: