A Comparison of Propensity Score Matching. A Simulation Study

A Comparison of Propensity Score Matching Methods in R with the MatchIt Package: A Simulation Study A thesis submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of Master of Arts In the Department of Quantitative and Mixed Methods Research Methodology of the College of Education, Criminal Justice, and Human Services by Jiaqi Zhang B.A. Chengdu College of University of Electronic Science and Technology of China March 2013 Committee Chair: Christopher M. Swoboda, Ph.D.

Abstract Propensity score matching (PSM) methods are becoming increasingly popular in non-experimental and observational studies to reduce selection bias through balancing measured covariates. This process has been developed into a relatively systematic and scientific branch of matching methods. MatchIt is a package in the statistical programming software R that allows for matching using several methods, including nearest neighbor, caliper, stratification, and full matching in order to find cases balanced on the propensity score between the treatment and the control group and achieve causal inference. Choosing which of those options to implement can be confusing for researchers. In this present study, these different methods are explained and a simulation study is conducted using example data to illustrate differences in these methods. The generated data is assigned based on a function of observed covariates and randomness, simulating selection bias, and analyzed to examine whether any of five popular propensity score matching methods perform more effectively in balancing covariates and reducing the selection bias within a given sample size. This study shows that each propensity score matching method Nearest Neighbor (1:1), Nearest Neighbor (2:1), Caliper, Stratification, and Full matching methods performs well in matching, and they all provide strong evidence to make casual inferences. This is particularly true for Caliper and Full matching. R code, detailed results and suggestions for future study are also provided. Keywords: propensity score, covariates, bias, simulation, casual inference ii

iii

Acknowledgement I would like to express the deepest appreciation to my committee chair, Professor Christopher M. Swoboda for his guidance, monitoring and constant encouragement throughout the whole process of this thesis. By doing article reviews continually for over a year with him, I began to focus on propensity score estimation. His erudite quantitative knowledge, interesting teaching, and encouragement all conveyed a spirit of adventure with regard to research. Without his guidance, encourage and persistent help, this thesis would not have been possible. I would also like to thank my committee member, Professor Marcus L. Johnson, whose courses and knowledge have allowed me to build my skills to write literature reviews. His knowledge of IRB and APA format has also helped me across all of my studies. In addition, a thank you to my former advisor Professor Wei Pan of Duke University, who introduced me to SPSS and needed statistical knowledge for me to finish this thesis. iv

Table of Contents Abstract... ii Acknowledgement... iv List of Tables... vii List of Figures... ix 1. Introduction... 1 1.1 Rubin Causal Model... 2 1.2 Propensity Score Matching Methods... 5 1.2.1 Estimating Propensity Score... 5 1.2.2 Matching individuals using the propensity score... 6 1.2.2.1 Nearest Neighbor Matching... 7 1.2.2.2 Caliper Matching... 8 1.2.2.3 Stratification Matching... 10 1.2.2.4 Full Matching... 11 2. Research Overview and Research Questions... 11 3. Methods... 13 3.1 Variables... 13 3.2 Data Generation... 15 3.3 Propensity Score Matching... 17 3.4 Analysis of Covariance for Data Before Matching... 17 3.5 Selecting the Appropriate Method... 18 4. Results... 19 4.1 Understand the Generated Data... 19 v

4.2 Reducing Distance Using Propensity Scores... 20 4.3 Results for Different Matching methods... 21 4.4 Results for independent t-test... 24 4.5 Result for ANCOVA on data before matching... 24 5. Conclusion... 25 6. Discussion... 26 7. Reference... 30 Appendix A1: Results for 1000 Sample Size... 34 Appendix A2: Results for 2500 Sample Size... 47 Appendix A3: Results for 5000 Sample Size... 55 Appendix B: R Codes... 63 vi

List of Tables Table 1: Standardized Summary of Generated Data with 1000 Sample Size. Table 2: Average Proportions for SES and ISS. Table 3: Standardized Mean Difference on Propensity Score with 1000 Sample Size Data after applying matching methods across all simulation runs. Table 4: Matching Results for Different Matching Methods under 1000 Sample Size. Table 5: Average Standardized Mean Difference for covariates and propensity score of different matching methods under different sample size. Table 6: Results for Independent t test for different matching methods with 1000 sample size. Table 7: Overall ANCOVA Results with 1000 sample size. Table 8: Combined Overall Comparison for Propensity Score Matching Methods under Different Sample Size. Table 9: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 1000 sample size. Table 10: Standardized Summary of Generated Data with 2500 Sample Size. Table 11: Matching Results for Different Matching Methods under 2500 Sample Size. Table 12: Results for Independent t test for different matching methods with 2500 sample size. Table 13: Overall ANCOVA Results with 2500 sample size. Table 14: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 2500 sample size. Table 15: Standardized Summary of Generated Data with 5000 Sample Size. vii

Table 16: Matching Results for Different Matching Methods under 2500 Sample Size. Table 17: Results for Independent t test for different matching methods with 5000 sample size. Table 18: Overall ANCOVA Results with 5000 sample size. Table 19: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 5000 sample size. viii

List of Figures Figure 1: Different Propensity Matching Methods Considered In This Paper. Figure 2: Matching Results for Different Matching Methods with 1000 Sample Size. Figure 3: Jitter Plot of Distribution of Propensity Score for different methods with 1000 Sample Size. Figure 4: Average Histograms of Propensity Score for Matched and Unmatched Individual in Both Treatment and Control Groups for All Runs of Simulation with 1000 sample size. Figure 5: Matching Results for Different Matching Methods with 2500 Sample Size. Figure 6: Jitter Plot of Distribution of Propensity Score for different methods with 2500 Sample Size. Figure 7: Average Histograms of Propensity Score for Matched and Unmatched Individual in Both Treatment and Control Groups for All Runs of Simulation with 2500 sample size. Figure 8: Matching Results for Different Matching Methods with 5000 Sample Size. Figure 9: Jitter Plot of Distribution of Propensity Score for different methods with 5000 Sample Size. Figure 10: Average Histograms of Propensity Score for Matched and Unmatched Individual in Both Treatment and Control Groups for All Runs of Simulation with 5000 sample size. ix

1. Introduction In educational research, designs with random samples and random assignment provide the strongest evidence and confidence for researchers to make a causal inference about a treatment s effectiveness. However, in most cases, experimental designs are challenging to achieve. Researchers need to account for selection bias which is defined as as any characteristic of a sample that is believed to make it different from the study population in some important way (p. 239) (York, 1998) from the lack of randomization in assigning individuals to either treatment or control groups because those groups are often not equivalent nor comparable. As a solution to the problem of making causal inferences about treatment effectiveness in the presence of selection bias, propensity score matching methods are employed to balance treatment and control group in many fields, including statistics (Rubin, 2006), medicine (Christakis & Iwashyna, 2003), economics (Abadie & Imbens, 2006), political science (Herron & Wand, 2007), sociology (Morgan & Harding, 2006), and even law (Rubin, 2001). Propensity Scores (PS) (Rubin & Rosenbaum, 1984), originally introduced to reduce selection bias and estimate treatment effects in observation studies, have become an important tool for achieving a causal inference in quasi-experimental data. Propensity Score Matching (PSM) is a statistical matching technique that attempts to estimate the effect of treatment accounting for the covariates that predict whether an individual received the treatment. However, due to the many different propensity matching 1

methods, researchers can be challenged in deciding which propensity score matching method should be used in their data analysis. This paper compares five major propensity score matching methods, including Nearest Neighbor Matching (1:1), Nearest Neighbor Matching (2:1), Caliper Matching, Stratification Matching, and Full Matching under different given sample sizes by generating simulated data and then using PSM with the MatchIt package in the statistical software R. Differences in the effectiveness of these methods in the simulated example data can be used to guide researchers are their choice of matching mechanism. Further, understanding the impact of varying sample sizes in this context can further illuminate the choice of matching mechanism in different settings for researchers. For the rest of this paper, I will introduce the different methods, present the simulation study I am using to compare those methods, summarize the results and present the conclusions and future suggestions. 1.1 Rubin Causal Model The idea behind the Rubin Causal Model is based on potential outcomes and the assignment mechanism such that an individual has different potential outcomes depending on the condition assigned. Rubin (1974) defines a causal effect as: Intuitively, the causal effect of one treatment E, over another C, for a particular individual and an interval of time from t 1 to t 2 is the difference between what would have happened at time t 2, if the individual had been exposed to E initiated at t 1 and what would have happened at t 2, if the individual had been exposed to C initiated at t 1. (p.689) According to Rubin s definition, τ i = Y i1 Y i0 2

where τ i denotes the treatment effort for individual i, Y i1 denotes the potential outcome for individual i in the treatment group, and Y i0 denotes the potential outcome for individual i in the control group. The observed outcome for individual i can be obtained by: where Y i = T i Y i1 + (1 T i )Y io 1, i {Treatment} T i = { 0, i {Control} Causal inference can also be thought of as a missing data problem, because for each individual i, only one potential outcome, either Y i1 or Y i0, can be observed at one time. For example, there may be a true causal effect on income if an individual obtains a Master s degree than not. In order to evaluate the causal effect on income of getting a Master s degree versus not, researchers would want to look at the outcome of income for the same individual in both treatment (with a Master s degree) and control (without a Master s degree) group. However, an individual can only be assigned into either treatment group or control. Therefore, if an individual is pursuing a Master s degree, researchers will fail to observe the potential outcome from that same individual not getting a Master s degree. Instead, the unit of analysis is changed to the group level such that causal inferences are applied through group mean difference. In the ideal case, if randomization is applied, observed and unobserved potential outcomes are balanced across treatment and control groups with high probability as the sample size increases, because random assignment makes it independent of outcome Y i1 and Y i0. As the notation {Y i0, Y i1 T i }. That is for j = 0, 1 3

E(Y ij T i = 1) = E(Y ij T i = 0) = E(Y i T i = j) Thus, the average treatment effect (ATE) is estimated as: τ = E(Y i1 T i = 1) E(Y i0 T i = 0) = E(Y i T i = 1) E(Y i T i = 0) Unfortunately, in most situations that randomization is not available and as such, covariates are almost never balanced across treatment and control groups (Jasjeet, 2011). The average treatment effect for the treated (ATT) is of interest due to the unbalance of treatment and control groups. τ (T = 1) = E(Y i1 T i = 1) E(Y i0 T i = 1) When randomization is achieved, we can estimate the average treatment effect (ATE) directly, otherwise, we need to take the average treatment effect for the treated (ATT) into consideration. By examining the observable covariates X, researchers are able to select individuals into treatment and control groups. This will result in selection bias when randomization is unavailable, but such selection bias can be reduced by the presumption of independence: E(Y ij T i, X i ) = E(Y ij X i ) (Heckman, Ichimura, Smith, & Todd, 1998). By following Rubin (1977), for j = 0, 1: E(Y ij X i, T i = 1) = E(Y ij X i, T i = 0) = E(Y ij X i, T i = j) where X i is the observed covariates for each individual i in treatment and control groups, thus, the average treatment effect (ATT) for the treated is estimated as τ (T = 1) = E{E(Y i X i, T i = 1) E(Y i X i, T i = 0) T i = 1} One route to achieve this comparison is through propensity score matching. 4

1.2 Propensity Score Matching Methods The idea behind propensity score matching is to match individuals in unbalanced treatment and control groups based on their probability of receiving treatment as estimated by the propensity score. While using propensity score matching, there are three main steps involved in the application of statistical matching: (i) Estimating propensity score, (ii) Matching individuals using the propensity score to create a balance in observed covariates across treatment and control groups; (iii) Evaluating the quality of the balance for matched data. 1.2.1 Estimating Propensity Score The Propensity Score (PS) is defined as a conditional probability of receiving a given treatment on a vector of observed covariates for individual i and treatment T i (Rubin & Rosenbaum, 1984). The propensity score is the estimated probability ps(i) = Pr(T i = 1 X i ) of assignment to the treatment given a set of observable covariates X i. Usually, researchers estimate propensity score by using logistic regression: ps(i) ln( 1 ps(i) ) = β 0 + β 1 X 1 + β 2 X 2 + + β n X n ps(i) = exp (β 0 + β 1 X 1 + β 2 X 2 + + β n X n ) 1 + exp (β 0 + β 1 X 1 + β 2 X 2 + + β n X n ) where β 0 + β 1 X 1 + β 2 X 2 + + β n X n is a linear regression for observed covariates. To apply this method, there are two assumptions that are referred as strong ignorable as stated in Rubin Causal Model: 1) Y i0, Y i1 T i X i, meaning potential outcomes are independent with treatment by controlling observable covariates X i. 5

2) 0 ps(i) 1, meaning the propensity score is a probability that is no less than 0 and no greater than 1. Keeping the control group as large as possible to increase the likelihood of finding better matching individuals for the treatment group is a general tendency while using propensity score matching. Generally, the rule of thumb is to choose a control group no more than nine times as large as the treatment groups (Bagley, White, & Golomb, 2001). Another essential issue is variable selection in the propensity score models. If selection is captured in the propensity score model, after the matching strategy is employed, selection bias should be eliminated (Victor, 2011). Omitting important variables would result in remaining selection bias and biased values for ATT (Marco & Sabine, 2005). There are two basic guides for variable selection (Caliendo & Kopeinig, 2005): Include all variables related to the outcome variables Only including known confounders will get biased and unbalanced results; Variables related to treatment but not related to outcome variable should not be included. 1.2.2 Matching individuals using the propensity score After estimating propensity score for each individual, researchers should select a propensity score matching method to perform the actual matching process. Figure one shows five of the main propensity score matching methods are Nearest Neighbor (1: 1) Matching, Nearest Neighbor (2: 1) Matching, Caliper Matching, and 6

Full Matching (Rubin, 1973; Althauser, R & Rubin, D. 1970; Rosenbaum & Rubin, 1984; Rosenbaum, 1991a): Most of the matching methods incorporate the caliper method to improve the quality of matching, because by setting an acceptable maximum distance for propensity score, matching methods cannot match individuals with distant propensity scores. 1.2.2.1 Nearest Neighbor Matching Nearest Neighbor Matching is the most straightforward matching method. In this method, one individual i from the control group is chosen as a matching individual for a treated individual by using the minimized difference between the estimated propensity score for treatment and control group regardless the distance between the propensity score between treatment and control group. C(P i ) = min j P i P j where i {Treatment Group}, j {Control Group}, C(P i ) is the selected matching set of control subject j matched to treated individual i, P i is the propensity score for treated individual i, and P j is the propensity score for control individual j. In other words, nearest neighbor matching aims to find the closest j in control to treated i. While using nearest neighbor matching method, there is a trade-off between precision and bias. Precision is the accuracy across matching methods, while bias comes from matching errors. There are two main reasons for this: a) Replacement Whether matching with replacement or matching without replacement. In the former case, an individual can be used more than once as 7

a match, that is for each individual i in treatment group matched with its nearest individual j in control. As a result, the bias is reduced and the average quality of matching is increased, but fewer individuals will be selected, which decreases the precision, because there may be more than one individual i that can be matched with individual j. But for the latter case, a individual can be used only once, meaning each individual i in treatment group is matched with its nearest unmatched individual j in control group, but the it may not be the closest one. b) Ratio What is the rate of matching regarding how many individuals are in treatment and control group? This will directly affect how many individuals will be included in the study. For example, in one study, there are totally 100 individuals, 30 in treatment group and 70 in control one. If one uses the nearest neighbor matching method with ratio 1: 1, there will be a maximum 60 individuals taken into consideration. But if the ratio is 2: 1, the maximum number of individuals involved could be 90. As a result, a lower ratio will reduce the bias and increase the average quality of matching, but lower the precision, while a higher ratio will rise the precision, but decrease the average quality of matching and increase the bias. 1.2.2.2 Caliper Matching Caliper matching (Cochran & Rubin, 1973) is a matching method developed from Nearest Neighbor Matching method, when the matched individual j is still statistically far away from the treated individual i. By setting a caliper, which is a 8

tolerance level δ on the maximum propensity score distance P i P j, researchers can avoid this problem. Formally: C(P i ) = min P i P j < δ j where i {Treatment Group}, j {Control Group}, C(P i ) is the selected matching set of control subject j matched to treated individual i, P i is the propensity score for treated individual i, P j is the propensity score for control individual j, and δ is the pre-specified tolerance (caliper). Performing a caliper matching means that individual j selected from the control group is the closest in terms of propensity score to individual i in treatment group, but within the caliper. Thus, it is important to select an appropriate caliper δ, since an inappropriate caliper especially when δ is small may result in few individuals being matched. Rubin and Cochran (1973) suggested guidance for determining an appropriate δ for each matching; however, later researchers noted that a possible draw back of caliper matching is that it is difficult to know what choice for the tolerance level is reasonable (Smith & Todd, 2005). By considering the Mahalanobis distance rather than propensity score, Rubin and Rosenbaum (1985) suggested that a caliper 0.2 standard deviation removes 98% of the bias if the variance in the treatment group is twice as large as the control group, and they generally suggest a caliper of 0.25 standard deviation of the propensity score. In this specific study, I followed their suggestion to use a caliper with tolerance δ = 0.25 SD. 9

1.2.2.3 Stratification Matching If the matching methods mentioned above regard all individuals as one component, the stratification matching method (or interval matching) categorizes both treatment and control groups based on their individual propensity score, and then groups individuals and matches them across k strata. For example, one group may contain the lowest stratum of the propensity score with individual i and j with the lowest performance observed. Thus, in stratification matching, the number and individuals are different from one group to another. Even though the number may vary, the average propensity score within each group between treatment and control group should not be systematically different from each other. To confirm this, the balance of stratification matching should be examined by a standard t-test. One question needs to be answered when conducting a stratification matching is how many strata should be used. In most situations, researchers use the quintile scale to stratify individuals into 5 different strata (Susanne, 2012). Five subclasses are often enough to remove 95% of the bias associated with one single covariate (Cochrane & Chambers, 1965), because almost all bias are associated with the propensity score, and under normality, five strata removes most of the bias associated with all covariates. In an individual analysis, checking the balance of the covariates within different groups can help researchers to justify the number of strata through the following: Most of the algorithms can be described in the following way: First, check if within a stratum the propensity score is balanced. If not, strata are too large and need to be split. If, conditional on the propensity score being balanced, the covariates are unbalanced, the specification of the propensity score is not adequate and has to be respecified. (Marco & Sabine, 2005) 10

After the stratification matching is performed and balance is satisfied, by using the mean difference for outcome variable between treatment and control group for each strata, researchers calculate the overall impact with weights proportional to the number of treatment individuals in each strata. If treatment individual does not match with control group in one particular strata, those individuals will be automatically omitted, because that stratum weights zero. 1.2.2.4 Full Matching Full matching is a particular type of stratification matching method that forms the subclass in an optimal way (Rosenbaum, 2002, Hansen, 2004). In full matching, researchers group individuals into a series of non-overlapping matched sets, taking all available individuals in the data. Each matched group contains one treated individual and multiple control individuals, or one control individual with multiple treated individuals. To achieve this, researchers still measure the distance d ij = P i P j between a given individual i in treatment group and a given individual j in control group. Then, they divide the full sample into some non-overlapping matched sets. Finally, they minimize the sum of the distance of all pairs of treated and controlled individuals within each matched set with all matched sets. d ij i T j C 2. Research Overview and Research Questions In this present study, several propensity score matching techniques are examined using data from a simulated analysis to show potential advantages and 11

disadvantages for those different techniques, while also varying sample size. The most effective method should include as many as individuals as possible in treatment and control group and obtain statistical significance for the matching method because a significant difference post-matching was simulated. To achieve this goal, this thesis used the mvtnorm package in R software to generate covariate data (Alan, et al., 2012). After this, propensity score matching methods were performed with MatchIt package in R (Ho, Stuart, Imai, & King, 2011), which intends to achieve propensity score matching methods including Nearest Neighbor Matching (1:1, 2:1), Caliper Matching, Stratification Matching and Full Matching. To analyze each matching result, an ANCOVA was conducted to compare matched and unmatched data, and for those different matching methods, t-tests were used to compare those matching methods. In an ideal situation, for the outcome variable, ANCOVA should not detect a significant difference on data before matching, but after matching, the t test should detect a significance difference, because after matching, the simulation was created for a situation where individuals in treatment should have increased outcomes. And finally, the average treatment effect for the treated (ATT) was calculated for those different propensity score matching methods. Thus, as discussed above, the research questions are: a. Which propensity matching method achieves the highest percent improvement? b. Which propensity matching method achieves the most balanced covariates? c. Are there differences in a) and b) by sample size? 12

For each simulation run, samples are randomly generated and randomly assigned into treatment and control group. Data is compared before and after matching to show the different of matching methods and how much percentage improved by different matching methods. 3. Methods 3.1 Variables In this present study, data was simulated using an example of education research assessing the effectiveness of an intervention in a quasi-experimental setting. For this simulated example, the dependent variable was the post-test (PosTe) score. The Individual Academic Program (IAP) serves as the dichotomous treatment variable in this study, where a value of 1 indicated that the individual received an Individual Academic Program to improve their academic skills. Other variables included in this study were age (Age), gender (Gender), Socioeconomic Status (SES), Individual Study Strategy (ISS), Intelligence Quotient Test Score (IQTe), and Pre-test Score (PreTe). Each of these variables is described in detail below: For the treatment IAP being assessed in this simulation study, each individual was assigned into either treatment or control group by using an indicator related to all other variables for details, see R code in Appendix D. Socioeconomic Status (SES) Score is measure of an individual s social and economic position in relation to others. The higher the Socioeconomic Status Score, the higher potential the students have Individual Study Strategy representing selecting bias and get higher score in their PreTe and PosTe. Socioeconomic status 13

typically falls into three categories, high SES, middle SES, and low SES, which were indicated in this study with 1, 0, and 1 respectively. Individual Study Strategy (ISS) was a dichotomous variable showing whether a student has their own study strategy after school. A value of 1 indicated that students have their own study strategy, while a value of 0 showed there was no individual study strategy. Generally, students with an individual study strategy will get a higher test score on PreTe and PosTe. Variables like Age (Age), Gender (Gender) and Intelligence Quotient Test Score (IQTe) were also take into consideration. The correlation matrix used for the study was as follows: Age Gender SES ISS IQTe PreTe Age 1 Gender 0 1 SES 0.3 0.2 1 ISS 0.6 0.1 0.5 1 IQTe 0.2 0.3 0.5 0.2 1 PreTe 0.5 0.1 0.7 0.4 0.9 1 Those values were based on presumed relationships in an example of a reallife setting. The variable Age was relative highly correlated with ISS and PreTe, because as Age increase, students develop their own ISS and since they will have more knowledge, they will have a higher PreTe score than younger students. Variable Gender has lower correlation with other variables, while for variable SES, it should have a relatively higher relation with PreTe, because students from high SES families have more accessible to recourses, including IQTe. 14

3.2 Data Generation By using the mvtnorm package in the R software, I generated data. Then I defined an indicator to predict whether an individual should be assigned into treatment group. This indicator correlates with all other variables through the vector: Indicator Age 0.4 Gender 0.2 SES 0.8 ISS 0.6 IQTe 0.9 PreTe 0.6 This indicator for each individual, and a random draw, was then used to act as a cue to assign students into treatment or control group, creating selection bias. An individual whose indicator value was greater than the 75 percentile of indicator across all individuals had a higher probability to be assigned into treatment group with IAP = 1, otherwise, the individual has a higher probability of being in control group with IAP = 0. After the assignment of IAP, the outcome variable PosTe was then calculated by taking all variables, including a true effect for IAP and error, into consideration. The correlation before error for PosTe and other variable was: PosTe Age 0.5 Gender 0.1 SES 0.7 ISS 0.4 IQTe 0.9 PreTe 0.5 15

The true effect of IAP was calculated by multiplying IAP with a true effect index, which in this study is 0.8, representing a significant effect. As a result, the true effect of IAP should be 0.8 IAP. Finally, a systematic error was calculated by a random norm distribution with mean 0 and standard 0.5. Therefore, as a result, the outcome variable PosTe was calculated as: ProTe = α i V i + 0.8 IAP + error where α i V i were the linear regression of other variables. The data were created to have a mean and standard deviation fitting the following pattern: M PreTe < M PosTe SD PreTe > SD PosTe The difference (Diff) between pre-test (PreTe) and post-test (PosTe) was given to create the growth z-score (ZGrowth) for each individual based on their post-test after treatment. The mean and standard deviation was still calculated for PreTe and PosTe. For different sample sizes a small sample size (n = 1000), a medium sample size (n = 2500) and a large sample size (n = 5000) 1000 runs of simulation were conducted. After all simulations, the average statistics were pulled out by a defined function called mean. list(). In order to make a difference among different matching methods, I manipulated the ratio of treatment and control group to 1: 3, meaning 1/4 of the Sample. Size was in treatment group, and 3/4 of the Sample. Size was in control group. 16

3.3 Propensity Score Matching After all the data was generated matching was conducted with each of the five propensity score matching methods Nearest Neighbor Matching (1: 1, 2: 1), Caliper Matching, Stratification Matching and Full Matching with MatchIt package in R. 3.4 Analysis of Covariance for Data Before Matching For each different generated dataset, a ANCOVA was conducted before applying matching methods in order to see whether there is a significant difference in the outcome variable of the generated data. In this particular research setting, results for ANCOVA should not be significant. The model of ANCOVA was: PreTe i = μ + IAP i + α i V i + error i where μ is the overall mean, IAP i is the treatment effect, and α i V i is the linear regression for other variables. For matched data under different matching methods, an independent sample t-test was conducted for the z-score growth (ZGrowth) to determine whether the matching result was statistically significant or not. ANCOVA served as a basis of comparison for the implemented propensity score methods. Then, the average effect of the treatment on the treated (ATT) was calculated for each propensity score matching methods. Finally, I calculated the percentage of bias reduced to show the balance improved by different methods, by subtracting the standardized mean for matched from the unmatched data and dividing the standardized mean difference by the unmatched data: 17

RB = MD unmatched MD matched MD unmatched 100% where RB represents the reduced bias, MD represents the mean difference. All analysis including ANCOVA, t-test and confidence intervals made it possible to examine whether any of the propensity score matching methods were substantially differ from other methods, so that to make a best selection from those analysis. 3.5 Selecting the Appropriate Method The results of those were simulated 1000 times and the results synthesized. Based on analysis results from the ANCOVA and independent t test, the appropriate propensity score matching methods could be determined by examining the sample size, average mean difference, balance improvement on covariates and the percent of improvement. An appropriate propensity score matching method should result in a high balanced improvement on covariates and a high percent of improvement of selection bias. Following Rubin s (2001) suggestion, I examined whether the mean difference for different propensity matching methods was less than. 25. If it was greater than. 25 under a given sample size, such a method was not appropriate. Then, I used the percentile of matched data to examine whether any of those methods significantly improved the dada. The methods that reduced the most percentage of bias were considered the best. In order to present and explain the results, I first explained the results of the simulation, including descriptive characteristics of the variables in the datasets, the 18

result of the ANCOVA analysis of the unmatched data, and the matching results based on related matching methods. Then I explain the result of the independent t test in detail for the different matching methods. The entire code for this study is provided in Appendix D. 4. Results 4.1 Understand the Generated Data It is important to first understand the generated sample. Table 1 shows the standardized summary of generated data with sample size 1000. For variable Age, it had a mean 18 with standard deviation 2. From Table 1, the youngest sample was 12 years old, while the oldest one was 24 years old. For the binomial variable Gender, a positive value (Gender 0) represents male, while a negative value (Gender < 0) stands for female. The mean was 0.11 for Gender and the median was 0.09, showing there was a slight difference in the amount of males and females. For the variable SES, it separated into three categories 1/3 of the sample sizes with Low SES, 1/3 with median SES, while the rest 1/3 with High SES. Like Gender, the variable ISS was considered dichotomously, where a positive value (ISS 0) represents students who had Individual Study Strategy, and a negative value (ISS < 0) stands for not have Individual Study Strategy. From Table 1, the ratio of students who have ISS and who does not have ISS was near 1: 1. For variables IQTe, PreTe and PosTe, they were all normal distributed. For PosTe and PreTe, PosTe has a higher mean and a narrower range, because, the data were generated to replicate an education setting where people prefer not only a higher 19

mean score, but also a narrower standard deviation. Diff was calculated by the formula: Diff = PosTe PreTe. ZGrowth was calculated based on variable Diff. At the baseline observation, there are six sub small categories for the generated dataset. They were: High SES students with ISS High SES students without ISS Medium SES students with ISS Medium SES students without ISS Low SES students with ISS Low SES students without ISS The average percentage of the entire 1000 sample size for all 1000 simulation run for each category was shown in Table 2. 4.2 Reducing Distance Using Propensity Scores After generating the data, propensity scores were estimated based on Age, Gender, SES, ISS, IQTe and PreTe using the following model: IAP ~ Age + Gender + SES + ISS + IQTe + PreTe Before matching, the standardized mean difference is 0.0011, which is a large distance that should result in a high selection bias. Propensity score matching methods are intended to reduce this mean difference as much as possible, the more they are reduced, the better the propensity score matching method is. After applying propensity score matching, the standardized mean difference is decreased close to 0 for Nearest Neighbor (1: 1), Nearest Neighbor (2: 1), Caliper, and Full matching. For Caliper Matching Method, there is a restricted variability for propensity score 20

matching method, because, unlike the Nearest Neighbor (1: 1) and the Nearest Neighbor (2: 1), there is a restriction on the distance between estimated propensity score. That is the reason why for Caliper matching methods, the standard deviation of the control group is smaller than all other methods. Tables 3 provide information of the average standard mean difference in propensity score between the treatment group and control group across all simulation runs for different matching methods with unmatched data and matched data. According to Table 3, it is clear that comparing with unmatched data, all other matching methods other than Stratification Matching performed well in reducing the standardized mean difference. 4.3 Results for Different Matching Methods I applied the nearest neighbor (1: 1), nearest neighbor (2: 1), caliper with 0.25 SD, stratification with 5 subclasses, and full matching method to the generated data to get improved balanced treatment individuals with control individuals. Table 4 and Figure 2 show the average matching result for different matching methods. From Table 4, it is clear that, for the 1000 generated data, there are 690 individuals in control group and 310 individuals in treatment group on average. For nearest neighbor (1: 1), nearest neighbor (2: 1), all treated individuals have matched with control group, as a result, there are totally 620 individuals involved in matching. For caliper matching, not all individuals in treatment group matched with control group, because there is a caliper for maximum distance. In this matching method, there are 510 individuals involved in matching. 21

For Stratification and Full matching, all individuals are included in the matching. It is important to take how many individuals are involved in matching, because the more individuals involved in matching, the less bias. Finally, for all the matching methods, there were no discarded individuals, meaning all individuals in treatment and control group can be used in matching method in this particular study. However, for a real data analysis, the discarded individual will not necessarily be zero. Figure 3 is a jitter plot of distribution of propensity score by different matching methods which shows how individual in treatment and control group matched with each other. Nearest neighbor (1:1), nearest neighbor (2:1), and caliper matching method have more individuals in control, and as a result, the extreme values in different groups are eliminated. Because not all individuals are involved in the matching method, there is more bias, though it is more precise. For stratification matching and full matching, all individuals were included in matching methods. Such results are with less bias but a lower precision. It is very important to keep the trade-off between bias and precision in mind, because it is an essential factor in selecting the best and appropriate matching method. Figure 4 shows the histograms of the propensity score for both matched and unmatched individuals in both treatment and control group. Distributions of propensity score for treatment are identical for both matched and unmatched data for all those matching methods except for caliper matching, because all individuals in treatment group are involved in those matching methods. For caliper matching 22

method, the distribution is slightly different in matched and unmatched data, because not all individuals involved in matching. Another important point is that the distributions of propensity score for control are identical for both matched and unmatched data for stratification and full matching method, because all individuals are used in matching methods. But for nearest neighbor (1:1), nearest neighbor (2:1), and caliper matching method, distributions are different. In order to select the best matching methods, I examined whether the standard mean difference for each individual covariate and propensity score is balanced or not. Table 5 shows the average standardized mean difference for each of the covariates and propensity score for all matching method. According to Table 5, before matching, the mean difference between treatment and control group is 0.0063, for nearest neighbor (1: 1) under sample size 1000. While after matching, the mean difference is only 0.0002. It is clear that without applying propensity score matching methods, there are more bias. And there is a 96%, 50%, 100%, 86% and 98% percent improvement, respectively, after applying different matching methods on average across the 1000 simulation runs. Thus, the order from best to worst is: Caliper > Full > Nearest Neighbor (1: 1) > Stratification > Nearest Neighbor (2: 1) Before determining the best method, the statistical significance of the difference must be examined, so I ran the independent sample t test to examine it. 23

4.4 Results for independent t test For matched data, an independent t test was conducted to help decide whether there is a significant mean different between data before matching and data after matching. Table 6 provides the result for independent t test across all simulation runs. For each different matching method, the degree of freedom is greater than 120, and for a two-tail t test, the critical t value is 1.980. It is clear that the independent t test shows a statistical significance between data before matching and after matching. Those results are reasonable, since propensity score matching methods aim to decrease the selection bias. 4.5 Result for ANCOVA on data before matching Before matching, an analysis of covariance (ANCOVA) is conducted to see whether there is a significance difference between the mean difference of treatment and control before matching: PreTe i = μ + IAP i + α i V i + error i where μ is the overall mean, IAP i is the treatment effect, and α i V i is the linear regression for other variables. The result for the ANCOVA across all simulation runs is shown in Table 7. According to Table 7, ANCOVA failed to detect a significant difference on data before matching, which is different than the t test results. This is reasonable, since after applying propensity score matching methods, it is supposed to be a significance difference for treated and controlled individuals on outcome variable. 24

It is also need to point out that, as sample size increase, the percent improvement increase for all propensity score matching methods other than Full matching, and the mean difference decrease. More individuals in treatment and control group results in a better matching result. 5. Conclusion In this present study, I intended to make an overall decision selecting the best and most appropriate propensity score matching method among several matching methods under difference sample sizes and whether any of those matching methods improve balance substantially. Table 9 provides the comparison of reduction, mean difference, and t-value for all matching methods under difference sample size. From Table 8, it is clear that all of the matching methods achieve statistical significance after matching under difference sample size. The percent improved Caliper > Full > Nearest Neighbor (1: 1) > Stratification > Nearest Neighbor (2: 1) Mean Difference Stratification > Nearest Neighbor (2: 1) > Nearest Neighbor (1: 1) > Full Caliper It is essential to notice that with the sample size increase, the percent improvement for each individual propensity matching method increases, while the mean difference decreases, except for full matching method. This is reasonable because as sample size increase, individuals in both treatment and control group increase and the more individuals are involved in matching, the bias decrease, and 25

precision increase. On the other hand, the percent improvement for the full matching method decreases along with an increase of sample size. Future study is needed to detect why this happens. According to my study, caliper matching method result the highest percent improvement under all given sample size and the most balanced covariates. Sample size does not affect the result among all those propensity score matching methods. In conclusion, first, propensity score matching methods do really reduce the selection bias and create comparable matching groups. Next, this study suggests that the larger sample size involved, the more accurate matching results will be. As a result, I recommend using propensity score matching methods as a practical approach when conducting matching, especially the methods of caliper matching and full matching, because caliper matching is the most precise methods in this particular study which reduce the most bias, while full matching using the most individuals. If the sample size is too small to use caliper matching method, full matching is highly recommended. 6. Discussion In educational research, experimental designs with random sampling and random assignment provide the strongest evidence for making casual inference decisions. However, it is impractical in many real settings to achieve randomization. Therefore, research in educational settings is often based strongly on observational data or quasi-experiments. Propensity score matching methods can serve as a useful tool for educational researchers to obtain causal inference without randomly assigned data. 26

In my simulation, all of the considered propensity score matching methods worked well on different sample sizes, and result in an improved and balanced set of matched data. This balance is supported by both an insignificant difference in the ANCOVA result on covariates data before matching and a significant difference t test result in the data after matching. My study suggests that the caliper matching method performs best in minimizing the propensity score distance under small (n = 1000), medium (n = 2500), and large (n = 5000) sample sizes, but it always involves the least number of individuals, which increase its bias. Additionally, Rubin and Rosenbaum s (1985) suggestion to use 0.25 SD as the caliper was followed. However, later researchers argue that Rubin and Rosenbaum make use of 0.25 SD by considering matching on the Mahalanobis distance, which is not direct related to propensity score. Pawel (2011) made a slight modification on caliper mechanism to make the selection of caliper link with estimated propensity scores: C(P i ) = min P i P j < δp i j δ > min j P i P j P i As Pawel claimed, by directly linking caliper with estimated propensity score, the modified mechanism will result in better matches from the control group. Future study is needed to replace 0.25 SD with this modified mechanism to see whether the modified mechanism results in a better matching. Another limitation for this study is about the correlation matrix. I defined a reasonable correlation matrix based on experience and knowledge. If for real data 27

analysis, and if the correlation matrix is not the same or similar to my generated correlation matrix, the results for different matching methods may vary. Future studies should derive correlation matrices from existing educational datasets and possible vary the relationships in the data to cover a wider array of potential settings. When applying matching methods, the ideal result would be matching as many as individuals in control with individuals in treatment group on all relevant covariates. However, in practical settings, it is difficult to achieve this goal, because if achieved, the sample size will be reduced substantially. Selecting a proper propensity score matching method will provide a beneficial compromise. I had hypothesized the ranking of the propensity score matching methods as: Caliper > Nearest Neighbor (1: 1) > Full > Stratification > Nearest Neighbor (2: 1) However, after this study, the ranked result is: Caliper > Full > Nearest Neighbor (1: 1) > Stratification > Nearest Neighbor (2: 1) Finally, the matching results on propensity score result in very similar groups, while the unmatched data are in very dissimilar groups. To examine dissimilarity, I conducted an analysis of covariance (ANCOVA) on data before matching. There is an assumption under the ANCOVA approach, that the groups involved should be either similar to all covariates other than before matching, or irrelevant variables. In this particular study, covariates distributions were similar before and after propensity score matching methods. In all propensity score 28

matching methods, a significant difference on outcome variable was detected, while ANCOVA failed to detect a significant difference on outcome variable. Such simulation results suggested that the propensity score matching methods achieve in a more accurate results than ANCOVA. Besides these findings, as in many studies, some questions still remain. Also, it is of interest to compare those propensity score matching methods with some other existing matching methods, like Weighting Adjustments, Mahalanobis Metric Matching, and optimal matching. Even within propensity score matching methods, it is also important to examine how much balance of covariates matters among all those methods. Those future studies could provide more evidence for propensity score matching methods and how different propensity score matching methods make individual covariates balanced. It should also be explored why for the full matching method, there is a decrease of percent improvement while sample size increases. 29

7. Reference Abadie, A., & Imbens, G. (2006). Large Sample Properties of Matching Estimators for Average Treatment Effects. Econometrica, 74, 235-267. Alan, G., Frank, B., Tetsuhisa, M., Xuefei, M., L, F., Fabian, S., et al. (2012, 12 10). mvtnorm: Multivariate Normal and t Distributions. Retrieved 1 19, 2013, from The R Project for Statististical Computing: http://cran.rproject.org/web/packages/mvtnorm/index.html Alberto, A., & Guido, W. I. (2009). Matching on the Estimated Propensity Score. NBER Working Paper Series No. 15301, 99-112. Althauser, R., & Rubin, B. D. (1970). The Computerized Constriction of a Matched Sample. American Journal of Sociology, 76, 325-346. Austin, P. C., Grootendorst, P., & Anderson, G. M. (2007). A Comparison of the Ability of Different Propensity Score Models to Balance Measured Variables Between Treated and Untreated Subjects: A Monte Carlo Study. Statistics in Medicine, 26, 734-753. Bagley, S. C., White, H., & Golomb, B. A. (2001). Logistic Regression in the Medical Literature: Standards for Use and Reporting, with Particular Attention on the Medical Domain. Journal of Clinic Epidemiol, 54, 979-985. Brookhart, M. A., Schneeweiss, S., Rothman, K. J., Glynn, R. J., Avorn, J., & Sturmer, T. (2006). Variable Selection for Propensity Score Models. American Journal of Epidemiology, 163 (12), 1149-1156. Caliendo, M., & Kopeinig, S. (2005). Some Practical Guidance for the Implementation of Propensity Score Matching. Christakis, N. A., & Iwashyna, T. I. (2003). The Health Impact of Health Care on Families: A matched cohort study of hospice use by decedents and mortality outcomes in surviving, windowed spouses. Social Science and Medicine, 57 (3), 465-475. Cochran, W., & Rubin, D. B. (1973). Controlling Bias in Observational Studies. Sankyha, 35, 417-446. Cochrane, W., & Chambers, S. (1965). The Planning of Observational Studies of Human Population. Journal of the Royal Statistical Society, 128, 234-266. Couper, M. P. (2000). Web Surveys: A Review of Issues and Approaches. Public Opinion Quarterly, 64, 464-494. 30

D'Agostino, R. B. (1998). Jr. Tutorial in Biostatistics: Propensity Score Methods for Bias Reduction in Comparison of a Treatment to a Non-randomized Control Group. Stat Med, 17, 2265-2281. Dehejia, R., & Wahba, S. (1999). Causal Effects in Non-experimental Studies: Reevaluation of the Evaluation of Training Programs. Journal American Statistic Association, 94, 1043-1062. Dehejia, R., & Wahba, S. (2002). Propensity Score Matching Methods for Nonexperimental Causal Studies. Rev Econ Stat, 84, 151-164. Fan, X., & Nowell, D. L. (2011). Using Propensity Score Matching in Educational Research. Gifted Child Quarterly, 55-74. Feng, J., & Kai, X. (1983). A Comparison of Propensity Score Methods for Evaluating the Effects of Programs with Multiple Versions. Gu, X., & Rosenbaum, P. R. (1993). Comparison of Multivariate Matching Methods: Structures, Distances, and Algorithms. Journal of Computational and Graphical Statistics, 2, 405-420. Hackman, J. (1979). Sample Selection Bias as a Specification error. Econometrica (47), 153-161. Hansen, B. B. (2004). Full Matching in an Observational Study of Coaching for the SAT. Journal of the American Statistical Association, 99, 609-618. Heckman, J. J., Ichimura, H., Smith, J., & Todd, P. (1998). Characterizing Selection Bias Using Experimental Data. Econometrica, 66 (5), 1017-1098. Herron, M., & Wand, J. (2007). Assessing Partisan Bias in Voting Technology: The Case of the 2004 New Hampshire Recount. Electoral Studies, 26 (2), 247-261. Hirano, K., & Imbens, G. W. (2001). Estimation of Casual Effects Using Propensity Score Weighting: An Application to Data on Right Hear Catheterization. Health Services and Outcomes Research Methodology, 2, 259-278. Ho, D., Stuart, E., Imai, K., & King, G. (2011, 10 24). MatchIt: MatchIt. Retrieved 1 19, 2013, from The R Project for Statistical Computing: http://cran.rproject.org/web/packages/matchit/index.html Imbens, G. W. (2000). The role of the propensity score in estimating dose-response functions. Biometrika, 87, 706-710. Jasjeet, S. S. (2011). Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R. Journal of Statistical Software, 42 (7). 31

LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs. Am Econ Rev, 76, 604-620. Marco, C., & Sabine, K. (2005). Some Practical Guidance for the Implementation of Propensity Score Matching. Discussion Paper Series. Masafumi, F. (2011). Effects of Variables in a Response Propensity Score Model for Survey Data Adjustment: A Simulation Study. Behaviormetrika, 38 (1), 33-61. Morgan, S. L., & Harding, D. J. (2006). Matching Estimators of Causal Effects: Prospects and Pitfalls in Theory and Practice. Sociological Methods and Research, 35 (1), 3-60. Needleman, S., & Wunsch, C. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48 (3), 443-453. Onur, B. (2006). Too Much Ado about Propensity Score Models? Comparing Methods of Propensity Score Matching. International Society for Pharmacoeconomics and Outcomes Research, 377-385. Parsons, L. S. (2001). Reducing bias in a propensity score matched-pair sample using greedy matching techniques. In SAS SUGI 26, 214-226. Pawel, S. (2011). Dynamic Caliper Matching. entral European Journal of Economic Modeling and Econometrics, 97-110. Perkins, S. M., Tu, W., Underhill, M. G., Zhou, X. H., & Murray, M. D. (2000). The use of propensity scores in pharmacoepidemiological research. Pharmacoepidemiology and drug safety, 9, 93-101. Rosenbaum, P. R. (1991). A characterization of optimal designs for observational studies. Journal of the Royal Statistical Society, 53 (3), 597-610. Rosenbaum, P. R. (2002). Observational Studies, 2nd Edition. New York, NY, United States: Springer Verlag. Rubin, D. B. (1977). Assignment to a Treatment Group on the Basis of a Covariate. Journal of Educational Statistics, 2, 1-26. Rubin, D. B. (1974). Estimating Casual Effects of Treatment in Randomized and Nonrandomized Studies. Journal of Educational Psychology, 66 (5), 689. Rubin, D. B. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127, 757-763. Rubin, D. B. (2006). Matched Sampling for Causal Effects. New York: Cambridge University Press. 32

Rubin, D. B. (1973). Matching to Remove Bias in Observational Studies. Biometrics, 29, 159-184. Rubin, D. B. (2001). Using Propensity Scores to Help Design Observational Studies: Application to the Tobacco Litigation. Health Services and Outcome Research Methodology, 2 (1), 169-188. Rubin, D. B., & Rosenbaum, P. R. (1985). Constructing Control Group Using Multivariate Matching Sampling Methods that Incorporate Propensity Score. The American Statistician, 39 (1), 33-38. Rubin, D. B., & Rosenbaum, P. R. (1984). Reducing bias in observational studies using sub-classification on the propensity score. Journal of the American Statistical Association, 79, 516-524. Rubin, D. B., & Rosenbaum, P. R. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55. Smith, J., & Todd, P. (2005). Does Matching Overcome LaLonde s Critique of Nonexperimental Estimators? Journal of Econometrics, 125 (1-2), 305-353. Susanne, S. (2012, October 29). Propensity Score Based Data Analysis. Retrieved Feb 19, 2013, from http://cran.rproject.org/web/packages/nonrandom/vignettes/nonrandom.pdf Thomas, N., & Rubin, D. B. (1996). Matching Using Estimated Propensity Scores: Relating Theory to Practice. Biometrics, 52 (1), 249-264. Victor, M. (2011). What Is Selection and Endogeneity Bias and How Can We Address It? University of Washington, Seattle. York, R. O. (1998). Conducting Social Work Research. Boston: Allyn and Bacon. 33

Figure 1: Different Propensity Matching Methods Considered In This Paper Nearest Neighbor Matching (1:1) Full Matching Propensity Score Matching Methods Nearest Neighbor Matching (2:1) Stratified Matching Caliper Matching 34

Appendix A1: Results for 1000 Sample Size Table 1: Standardized Summary of Generated Data with 1000 Sample Size Age Gender SES ISS IQTe Min. : -0.35-0.304-0.33-0.298-0.3 1st Qu.: -0.07-0.069-0.07-0.07-0.07 Median : 0 0.002-0.01-0.004 0 Mean : 0 0 0-0.003 0 3rd Qu.: 0.07 0.066 0.07 0.063 0.07 Max. : 0.34 0.276 0.4 0.313 0.33 PreTe IAP PosTe Diff ZGrowth Min. : -0.3 0-1.04-0.74-0.56 1st Qu.: -0.07 0-0.1-0.06-0.18 Median : 0 0 0.16 0.12-0.07 Mean : 0 0.31 0.25 0.25 0 3rd Qu.: 0.07 1 0.62 0.65 0.22 Max. : 0.33 1 1.62 1.39 0.64 35

Table 2: Average Proportions for SES and ISS With ISS Without ISS High SES 0.103 0.079 Medium SES 0.324 0.271 High SES 0.086 0.137 36

Table 3: Standardized Mean Difference on Propensity Score with 1000 Sample Size Data after applying matching methods across all simulation runs Means Treated Means Control SD Control Mean Diff Before Matching 0.318 0.307 0.046 0.011 Nearest Neighbor (1:1) 0.318 0.317 0.045 0.007 Nearest Neighbor (2:1) 0.318 0.311 0.044 0.007 Caliper Matching 0.312 0.312 0.038 0.006 Stratification 0.318 0.307 0.011 Full Matching 0.318 0.318 0.007 37

Table 4: Matching Results for Different Matching Methods under 1000 Sample Size Methods Control Treated All 690 310 Nearest Matched 310 310 Neighbor Unmatched 380 0 (1:1) Discarded 0 0 Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full All 690 310 Matched 620 310 Unmatched 70 0 Discarded 0 0 All 690 310 Matched 255 255 Unmatched 435 55 Discarded 0 0 All 690 310 Matched 690 310 Unmatched 0 0 Discarded 0 0 All 690 310 Matched 690 310 Unmatched 0 0 Discarded 0 0 38

Table 5: Average Standardized Mean Difference for covariates and propensity score of different matching methods under different sample size. Methods Nearest Neighbor (1:1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full Before Matching After Matching Sample Percent Means Means Mean Means Means Mean Size Improvement Treated Control Diff Treated Control Diff 1000 0.314 0.3077 0.0063 0.314 0.3138 0.0002 96.3248 2500 0.309 0.3067 0.0023 0.309 0.309 0.0001 97.8767 5000 0.3112 0.31 0.0013 0.3112 0.3112 0 98.3897 1000 0.314 0.3077 0.0063 0.314 0.3111 0.0029 49.1766 2500 0.309 0.3067 0.0023 0.309 0.3083 0.0007 69.4695 5000 0.3112 0.31 0.0013 0.3112 0.3109 0.0003 72.5936 1000 0.314 0.3077 0.0063 0.311 0.311 0.0001 99.3839 2500 0.309 0.3067 0.0023 0.3081 0.3081 0.0001 99.3514 5000 0.3112 0.31 0.0013 0.3108 0.3108 0 99.3515 1000 0.314 0.3077 0.0063 0.314 0.3077 0.0063 86.3333 2500 0.309 0.3067 0.0023 0.309 0.3067 0.0023 86.5977 5000 0.3112 0.31 0.0013 0.3112 0.31 0.0013 87.0667 1000 0.314 0.3077 0.0063 0.314 0.314-0.0001 98.9063 2500 0.309 0.3067 0.0023 0.309 0.309 0.0001 98.4755 5000 0.3112 0.31 0.0013 0.3112 0.3112 0 96.413 39

Table 6: Results for Independent t test for different matching methods with 1000 sample size. Nearest Neighbor (1:1) Nearest Neighbor (2:1) Caliper Stratification Full t-value 5.836 6.874 5.481 6.934 6.934 p-value 0.0151 0.0105 0.0144 0.0117 0.0117 α-level 0.05 0.05 0.05 0.05 0.05 95% CI Upper 0.5996 0.5806 0.6213 0.5746 0.5746 Lower 0.2974 0.3222 0.2924 0.3205 0.3205 H 0 : The difference between the means is 0, μ Before Matching = μ After Matching H 1 : The difference between the means is not 0, μ Before Matching μ After Matching 40

Table 7: Overall ANCOVA Results with 1000 sample size. Df Sum Sq Mean Sq F value Pr(>F) Age 1 246.647 246.647 690.971 5.42E-86 Gender 1 79.029 79.029 221.778 9.61E-30 SES 1 237.394 237.394 664.517 8.60E-95 ISS 1 21.01 21.011 58.784 3.35E-10 IQTe 1 58.454 58.454 163.707 8.96E-25 PreTe 1 0.585 0.585 1.626 0.4 Residuals 993 355.881 0.358 41

Table 8: Combined Overall Comparison for Propensity Score Matching Methods under Different Sample Size. Methods Sample Size Mean Diff Percent Improvement t-value Nearest Neighbor (1:1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full 1000 0.0067 96.3248 5.836 2500 0.0002 97.8767 9.1417 5000 0 98.3897 13.1287 1000 0.0068 49.1766 6.874 2500 0.0007 69.4695 10.869 5000 0.0003 72.5936 15.3948 1000 0.0059 99.3839 5.481 2500 0.0001 99.3514 8.7813 5000 0 99.3515 12.7886 1000 0.0109 86.3333 6.934 2500 0.0023 86.5977 11.076 5000 0.0012 87.0667 15.5441 1000 0.0072 98.9063 6.934 2500 0.0002 98.4755 11.076 5000 0 96.413 15.5441 42

Table 9: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 1000 sample size. Before Matching After Matching Percent Methods Means Means Mean Means Means Mean Improvement Treated Control Diff Treated Control Diff distance 0.314 0.3077 0.0063 0.314 0.3138 0.0002 96.3248 Age -0.0089-0.0064-0.0025-0.0089-0.0073-0.0016-80.8081 Nearest Gender 0.0012-0.0018 0.003 0.0012-0.0019 0.0031-373.6735 Neighbor SES -0.0094-0.0013-0.008-0.0094-0.0075-0.0019-181.4461 (1:1) ISS -0.0173-0.0111-0.0062-0.0173-0.0169-0.0003-70.4565 IQTe -0.0012-0.0015 0.0003-0.0012 0.0009-0.0021-154.466 PreTe -0.0063-0.0025-0.0038-0.0063-0.003-0.0033-143.6412 Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full distance 0.314 0.3077 0.0063 0.314 0.3111 0.0029 49.1766 Age -0.0089-0.0064-0.0025-0.0089-0.0097 0.0008 24.6616 Gender 0.0012-0.0018 0.003 0.0012 0.0021-0.0009-38.8343 SES -0.0094-0.0013-0.008-0.0094-0.0074-0.002-71.5272 ISS -0.0173-0.0111-0.0062-0.0173-0.0185 0.0012 1.3258 IQTe -0.0012-0.0015 0.0003-0.0012 0.0078-0.0012-52.6631 PreTe -0.0063-0.0025-0.0038-0.0063-0.0047-0.0016 7.7562 distance 0.314 0.3077 0.0063 0.311 0.311 0.0001 99.5839 Age -0.0089-0.0064-0.0025-0.0036-0.0099 0.0063-126.6774 Gender 0.0012-0.0018 0.003 0.0005-0.0002 0.0007-439.1077 SES -0.0094-0.0013-0.008-0.0042-0.0024-0.0019-264.9567 ISS -0.0173-0.0111-0.0062-0.013-0.0187 0.0057-235.0472 IQTe -0.0012-0.0015 0.0003 0.0027-0.0026 0.0053-258.9685 PreTe -0.0063-0.0025-0.0038-0.0002-0.0047 0.0046-135.5659 distance 0.3141 0.3077 0.0063 0.314 0.3077 0.0063 86.5977 Age -0.0089-0.0064-0.0025-0.0089-0.0064-0.0025 71.0325 Gender 0.0012-0.0018 0.003 0.0012-0.0018 0.003 17.5276 SES -0.0094-0.0013-0.008-0.0094-0.0013-0.008 64.1946 ISS -0.0173-0.0111-0.0062-0.0173-0.0111-0.0062 64.0571 IQTe -0.0012-0.0015 0.0003-0.0012-0.0015 0.0003 25.78 PreTe -0.0063-0.0025-0.0038-0.0063-0.0025-0.0038 60.4789 distance 0.314 0.3077 0.0063 0.314 0.314-0.0001 98.2063 Age -0.0089-0.0064-0.0025-0.0089-0.0113 0.0024-138.813 Gender 0.0012-0.0018 0.03 0.0012-0.0027 0.0039-106.9591 SES -0.0094-0.0013-0.008-0.0094-0.0108 0.0014-109.9768 ISS -0.0173-0.0111-0.0062-0.0173-0.0235 0.0062-70.4173 IQTe -0.0012-0.0015 0.0003-0.0012-0.0022 0.001-182.2413 PreTe -0.0063-0.0025-0.0038-0.0063-0.0068 0.0005-93.7985 43

Figure 2: Matching Results for Different Matching Methods with 1000 Sample Size 44

Figure 3: Jitter Plot of Distribution of Propensity Score for different methods with 1000 Sample Size. Nearest Neighbor (1: 1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 subclasses Full Matching 45

Figure 4: Average Histograms of Propensity Score for Matched and Unmatched Individual in Both Treatment and Control Groups for All Runs of Simulation with 1000 sample size. Nearest Neighbor (1: 1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full Matching 46

Appendix A2: Results for 2500 Sample Size Table 10: Standardized Summary of Generated Data with 2500 Sample Size Age Gender SES ISS IQTe Min. : -0.38-0.34-0.34-0.4-0.31 1st Qu.: -0.06-0.07-0.07-0.07-0.07 Median : 0.01-0.01 0 0 0 Mean : 0-0.01 0 0 0 3rd Qu.: 0.07 0.06 0.07 0.07 0.07 Max. : 0.31 0.34 0.36 0.32 0.33 PreTe IAP PosTe Diff ZGrowth Min. : -0.32 0-0.8-0.58-0.46 1st Qu.: -0.07 0-0.09-0.06-0.18 Median : 0 0 0.17 0.11-0.07 Mean : 0 0.31 0.25 0.25 0 3rd Qu.: 0.07 1 0.6 0.64 0.22 Max. : 0.31 1 1.62 1.43 0.67 47

Table 11: Matching Results for Different Matching Methods under 2500 Sample Size Methods Control Treated All 1726 774 Nearest Matched 774 774 Neighbor Unmatched 952 0 (1:1) Discarded 0 0 Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full All 1726 774 Matched 1548 774 Unmatched 178 0 Discarded 0 0 All 1726 774 Matched 685 685 Unmatched 1041 89 Discarded 0 0 All 1726 774 Matched 1726 774 Unmatched 0 0 Discarded 0 0 All 1726 774 Matched 1726 774 Unmatched 0 0 Discarded 0 0 48

Table 12: Results for Independent t test for different matching methods with 2500 sample size. Nearest Neighbor (1:1) Nearest Neighbor (2:1) Caliper Stratification Full t-value 9.1417 10.869 8.7814 11.0759 11.0759 p-value 0.0001 0.0001 0.0002 0.0001 0.0001 α-level 0.05 0.05 0.05 0.05 0.05 95% CI Upper 0.5432 0.5348 0.5499 0.5331 0.5331 Lower 0.3512 0.3712 0.3487 0.3726 0.3726 H 0 : The difference between the means is 0, μ Before Matching = μ After Matching H 1 : The difference between the means is not 0, μ Before Matching μ After Matching 49

Table 13: Overall ANCOVA Results with 2500 sample size Df Sum Sq Mean Sq F value Pr(>F) Age 1 604.764 604.764 1676.171 1.59E-228 Gender 1 202.126 202.126 560.075 1.88E-82 SES 1 589.417 589.417 1632.952 3.48E-235 ISS 1 51.504 51.504 142.706 5.79E-22 IQTe 1 148.737 148.737 411.645 6.48E-69 PreTe 1 1.136 1.136 3.163 0.242 Residuals 2493 901.316 0.362 50

Table 14: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 2500 sample size. Methods Before Matching After Matching Percent Means Means Mean Means Means Mean Improvement Treated Control Diff Treated Control Diff distance 0.309 0.3067 0.0023 0.309 0.309 0.0001 97.8767 Age 0.003-0.0002 0.0032 0.003 0.0071-0.0042-309.331 Nearest Gender -0.0056-0.0021-0.0035-0.0056-0.0077 0.0021-103.4416 Neighbor SES -0.0006-0.003 0.0023-0.0006-0.0003-0.0004-1221.5644 (1:1) ISS 0.0008-0.0033 0.0041 0.0008 0.0031-0.0022-92.787 IQTe -0.0055-0.0035-0.002-0.0055-0.0063 0.0008-219.9626 PreTe -0.0024-0.003 0.0006-0.0024-0.0015-0.0008-79.8798 Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full distance 0.309 0.3067 0.0023 0.309 0.3083 0.0007 69.4695 Age 0.003-0.0002 0.0032 0.003 0.0005 0.0024 7.8485 Gender -0.0056-0.0021-0.0035-0.0056-0.0045-0.0011 35.2922 SES -0.0006-0.003 0.0023-0.0006-0.0031 0.0025-1875.5309 ISS 0.0008-0.0033 0.0041 0.0008-0.001 0.0019 21.7349 IQTe -0.0055-0.0035-0.002-0.0055-0.0055 0.0001-28.3838 PreTe -0.0024-0.003 0.0006-0.0024-0.004 0.0016 36.8011 distance 0.309 0.3067 0.0023 0.3081 0.3081 0.0001 99.3514 Age 0.003-0.0002 0.0032 0.0024 0.0049-0.0024-1834.8499 Gender -0.0056-0.0021-0.0035-0.0065-0.0094 0.003-68.0536 SES -0.0006-0.003 0.0023-0.0037-0.0042 0.0005-2877.1986 ISS 0.0008-0.0033 0.0041-0.0009-0.0008-0.0001-173.7175 IQTe -0.0055-0.0035-0.002-0.008-0.0073-0.0006-315.2885 PreTe -0.0024-0.003 0.0006-0.0051-0.0036-0.0015-129.1921 distance 0.309 0.3067 0.0023 0.309 0.3067 0.0023 86.3333 Age 0.003-0.0002 0.0032 0.003-0.0002 0.0032-34.1417 Gender -0.0056-0.0021-0.0035-0.0056-0.0021-0.0035 66.4047 SES -0.0006-0.003 0.0023-0.0006-0.003 0.0023 41.9139 ISS 0.0008-0.0033 0.0041 0.0008-0.0033 0.0041 66.5175 IQTe -0.0055-0.0035-0.002-0.0055-0.0035-0.002 73.7666 PreTe -0.0024-0.003 0.0006-0.0024-0.003 0.0006 80.7408 distance 0.309 0.3067 0.0023 0.309 0.309 0.0001 98.4755 Age 0.003-0.0002 0.0032 0.003 0.0021 0.0008-179.7625 Gender -0.0056-0.0021-0.0035-0.0056-0.0032-0.0024-130.6474 SES -0.0006-0.003 0.0023-0.0006 0.0008-0.0014-122.3761 ISS 0.0008-0.0033 0.0041 0.0008 0.0031-0.0022-103.5341 IQTe -0.0055-0.0035-0.002-0.0055-0.0077 0.0022-79.1162 PreTe -0.0024-0.003 0.0006-0.0024-0.0045 0.0021-38.4984 51

Figure 5: Matching Results for Different Matching Methods with 2500 Sample Size 52

Figure 6: Jitter Plot of Distribution of Propensity Score for different methods with 2500 Sample Size Nearest Neighbor (1: 1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 subclasses Full Matching 53

Figure 7: Average Histograms of Propensity Score for Matched and Unmatched Individual in Both Treatment and Control Groups for All Runs of Simulation with 2500 sample size. Nearest Neighbor (1: 1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 subclasses Full Matching 54

Appendix A3: Results for 5000 Sample Size Table 15: Standardized Summary of Generated Data with 5000 Sample Size Age Gender SES ISS IQTe Min. : -0.34-0.37-0.37-0.35-0.33 1st Qu.: -0.07-0.07-0.07-0.07-0.07 Median : 0 0 0 0 0 Mean : 0 0 0 0 0 3rd Qu.: 0.07 0.07 0.06 0.07 0.07 Max. : 0.38 0.34 0.36 0.46 0.3 PreTe IAP PosTe Diff ZGrowth Min. : -0.35 0-0.82-0.54-0.45 1st Qu.: -0.07 0-0.09-0.06-0.17 Median : 0 0 0.16 0.11-0.08 Mean : 0 0.31 0.25 0.25 0 3rd Qu.: 0.07 1 0.61 0.65 0.23 Max. : 0.34 1 1.61 1.34 0.61 55

Table 16: Matching Results for Different Matching Methods under 5000 Sample Size Methods Control Treated All 3448 1552 Nearest Matched 1552 1552 Neighbor Unmatched 1896 0 (1:1) Discarded 0 0 Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full All 3448 1552 Matched 3103 1552 Unmatched 345 0 Discarded 0 0 All 3448 1552 Matched 1445 1445 Unmatched 2003 107 Discarded 0 0 All 3448 1552 Matched 3448 1552 Unmatched 0 0 Discarded 0 0 All 3448 1552 Matched 3448 1552 Unmatched 0 0 Discarded 0 0 56

Table 17: Results for Independent t test for different matching methods with 5000 sample size. Nearest Neighbor (1:1) Nearest Neighbor (2:1) Caliper Stratification Full t-value 13.1287 15.3948 12.7886 15.5441 15.5441 p-value 0.0001 0.0001 0.0001 0.0001 0.0001 α-level 0.05 0.05 0.05 0.05 0.05 95% CI Upper 0.5197 0.5102 0.5224 0.5055 0.5055 Lower 0.3846 0.3949 0.3833 0.3922 0.3922 H 0 : The difference between the means is 0, μ Before Matching = μ After Matching H 1 : The difference between the means is not 0, μ Before Matching μ After Matching 57

Table 18: Overall ANCOVA Results with 5000 sample size Df Sum Sq Mean Sq F value Pr(>F) Age 1 1217.552 1217.552 3377.543 0 Gender 1 394.27 394.27 1093.506 3.88E-186 SES 1 1182.503 1182.503 3279.603 0 ISS 1 102.135 102.135 283.228 2.61E-48 IQTe 1 299.814 299.814 831.564 6.38E-147 PreTe 1 1.444 1.444 4.001 4 Residuals 4993 1801.282 0.361 58

Table 19: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 5000 sample size. Methods Before Matching After Matching Percent Means Means Mean Means Means Mean Improvement Treated Control Diff Treated Control Diff distance 0.3112 0.31 0.0013 0.3112 0.3112 0 98.3897 Age -0.0022 0.0014-0.0035-0.0022-0.0022 0-90.8191 Nearest Gender 0.0016-0.0009 0.0024 0.0016 0.0014 0.0002-46.6452 Neighbor SES -0.0074 0-0.0074-0.0074-0.0064-0.001-11.372 (1:1) ISS -0.0002 0.0006-0.0008-0.0002-0.0025 0.0023-85.6743 IQTe -0.0018 0.0003-0.0021-0.0018-0.0012-0.0006-200.7105 PreTe -0.0045 0.0008-0.0053-0.0045-0.0037-0.0008-220.3229 Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full distance 0.3112 0.31 0.0013 0.3112 0.3109 0.0003 72.5936 Age -0.0022 0.0014-0.0035-0.0022-0.0026 0.0004 12.5727 Gender 0.0016-0.0009 0.0024 0.0016 0.0008 0.0008 40.3382 SES -0.0074 0-0.0074-0.0074-0.0065-0.0009 54.9241 ISS -0.0002 0.0006-0.0008-0.0002-0.0016 0.0014 30.4543 IQTe -0.0018 0.0003-0.0021-0.0018-0.001-0.0007-86.354 PreTe -0.0045 0.0008-0.0053-0.0045-0.0035-0.0009 27.8238 distance 0.3112 0.31 0.0013 0.3108 0.3108 0 99.3515 Age -0.0022 0.0014-0.0035-0.0018-0.0034 0.0016-142.029 Gender 0.0016-0.0009 0.0024 0.001 0.001 0-137.8081 SES -0.0074 0-0.0074-0.0044-0.0052 0.0007-29.6472 ISS -0.0002 0.0006-0.0008-0.0003 0.0003-0.0006-133.473 IQTe -0.0018 0.0003-0.0021-0.0007-0.0006 0-332.6149 PreTe -0.0045 0.0008-0.0053-0.0026-0.0034 0.0008-82.2883 distance 0.3112 0.31 0.0013 0.3112 0.31 0.0013 87.0667 Age -0.0022 0.0014-0.0035-0.0022 0.0014-0.0035 68.1516 Gender 0.0016-0.0009 0.0024 0.0016-0.0009 0.0024 81.7804 SES -0.0074 0-0.0074-0.0074 0-0.0074 81.5948 ISS -0.0002 0.0006-0.0008-0.0002 0.0006-0.0008 71.2604 IQTe -0.0018 0.0003-0.0021-0.0018 0.0003-0.0021-18.9422 PreTe -0.0045 0.0008-0.0053-0.0045 0.0008-0.0053 80.1985 distance 0.3112 0.31 0.0013 0.3112 0.3112 0 98.493 Age -0.0022 0.0014-0.0035-0.0022-0.0029 0.0007-193.9989 Gender 0.0016-0.0009 0.0024 0.0016 0 0.0016-51.6502 SES -0.0074 0-0.0074-0.0074-0.0073-0.0002 0.5987 ISS -0.0002 0.0006-0.0008-0.0002-0.0024 0.0022-109.6501 IQTe -0.0018 0.0003-0.0021-0.0018-0.0044 0.0027-179.1916 PreTe -0.0045 0.0008-0.0053-0.0045-0.0063 0.0018-77.4171 59

Figure 8: Matching Results for Different Matching Methods with 5000 Sample Size 60