A Comparison of Propensity Score Matching. A Simulation Study

Size: px

Start display at page:

Download "A Comparison of Propensity Score Matching. A Simulation Study"

Nicholas Watkins
5 years ago
Views:

2 A Comparison of Propensity Score Matching Methods in R with the MatchIt Package: A Simulation Study A thesis submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of Master of Arts In the Department of Quantitative and Mixed Methods Research Methodology of the College of Education, Criminal Justice, and Human Services by Jiaqi Zhang B.A. Chengdu College of University of Electronic Science and Technology of China March 2013 Committee Chair: Christopher M. Swoboda, Ph.D.

3 Abstract Propensity score matching (PSM) methods are becoming increasingly popular in non-experimental and observational studies to reduce selection bias through balancing measured covariates. This process has been developed into a relatively systematic and scientific branch of matching methods. MatchIt is a package in the statistical programming software R that allows for matching using several methods, including nearest neighbor, caliper, stratification, and full matching in order to find cases balanced on the propensity score between the treatment and the control group and achieve causal inference. Choosing which of those options to implement can be confusing for researchers. In this present study, these different methods are explained and a simulation study is conducted using example data to illustrate differences in these methods. The generated data is assigned based on a function of observed covariates and randomness, simulating selection bias, and analyzed to examine whether any of five popular propensity score matching methods perform more effectively in balancing covariates and reducing the selection bias within a given sample size. This study shows that each propensity score matching method Nearest Neighbor (1:1), Nearest Neighbor (2:1), Caliper, Stratification, and Full matching methods performs well in matching, and they all provide strong evidence to make casual inferences. This is particularly true for Caliper and Full matching. R code, detailed results and suggestions for future study are also provided. Keywords: propensity score, covariates, bias, simulation, casual inference ii

4 iii

5 Acknowledgement I would like to express the deepest appreciation to my committee chair, Professor Christopher M. Swoboda for his guidance, monitoring and constant encouragement throughout the whole process of this thesis. By doing article reviews continually for over a year with him, I began to focus on propensity score estimation. His erudite quantitative knowledge, interesting teaching, and encouragement all conveyed a spirit of adventure with regard to research. Without his guidance, encourage and persistent help, this thesis would not have been possible. I would also like to thank my committee member, Professor Marcus L. Johnson, whose courses and knowledge have allowed me to build my skills to write literature reviews. His knowledge of IRB and APA format has also helped me across all of my studies. In addition, a thank you to my former advisor Professor Wei Pan of Duke University, who introduced me to SPSS and needed statistical knowledge for me to finish this thesis. iv

6 Table of Contents Abstract... ii Acknowledgement... iv List of Tables... vii List of Figures... ix 1. Introduction Rubin Causal Model Propensity Score Matching Methods Estimating Propensity Score Matching individuals using the propensity score Nearest Neighbor Matching Caliper Matching Stratification Matching Full Matching Research Overview and Research Questions Methods Variables Data Generation Propensity Score Matching Analysis of Covariance for Data Before Matching Selecting the Appropriate Method Results Understand the Generated Data v

7 4.2 Reducing Distance Using Propensity Scores Results for Different Matching methods Results for independent t-test Result for ANCOVA on data before matching Conclusion Discussion Reference Appendix A1: Results for 1000 Sample Size Appendix A2: Results for 2500 Sample Size Appendix A3: Results for 5000 Sample Size Appendix B: R Codes vi

8 List of Tables Table 1: Standardized Summary of Generated Data with 1000 Sample Size. Table 2: Average Proportions for SES and ISS. Table 3: Standardized Mean Difference on Propensity Score with 1000 Sample Size Data after applying matching methods across all simulation runs. Table 4: Matching Results for Different Matching Methods under 1000 Sample Size. Table 5: Average Standardized Mean Difference for covariates and propensity score of different matching methods under different sample size. Table 6: Results for Independent t test for different matching methods with 1000 sample size. Table 7: Overall ANCOVA Results with 1000 sample size. Table 8: Combined Overall Comparison for Propensity Score Matching Methods under Different Sample Size. Table 9: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 1000 sample size. Table 10: Standardized Summary of Generated Data with 2500 Sample Size. Table 11: Matching Results for Different Matching Methods under 2500 Sample Size. Table 12: Results for Independent t test for different matching methods with 2500 sample size. Table 13: Overall ANCOVA Results with 2500 sample size. Table 14: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 2500 sample size. Table 15: Standardized Summary of Generated Data with 5000 Sample Size. vii

9 Table 16: Matching Results for Different Matching Methods under 2500 Sample Size. Table 17: Results for Independent t test for different matching methods with 5000 sample size. Table 18: Overall ANCOVA Results with 5000 sample size. Table 19: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 5000 sample size. viii

10 List of Figures Figure 1: Different Propensity Matching Methods Considered In This Paper. Figure 2: Matching Results for Different Matching Methods with 1000 Sample Size. Figure 3: Jitter Plot of Distribution of Propensity Score for different methods with 1000 Sample Size. Figure 4: Average Histograms of Propensity Score for Matched and Unmatched Individual in Both Treatment and Control Groups for All Runs of Simulation with 1000 sample size. Figure 5: Matching Results for Different Matching Methods with 2500 Sample Size. Figure 6: Jitter Plot of Distribution of Propensity Score for different methods with 2500 Sample Size. Figure 7: Average Histograms of Propensity Score for Matched and Unmatched Individual in Both Treatment and Control Groups for All Runs of Simulation with 2500 sample size. Figure 8: Matching Results for Different Matching Methods with 5000 Sample Size. Figure 9: Jitter Plot of Distribution of Propensity Score for different methods with 5000 Sample Size. Figure 10: Average Histograms of Propensity Score for Matched and Unmatched Individual in Both Treatment and Control Groups for All Runs of Simulation with 5000 sample size. ix

11 1. Introduction In educational research, designs with random samples and random assignment provide the strongest evidence and confidence for researchers to make a causal inference about a treatment s effectiveness. However, in most cases, experimental designs are challenging to achieve. Researchers need to account for selection bias which is defined as as any characteristic of a sample that is believed to make it different from the study population in some important way (p. 239) (York, 1998) from the lack of randomization in assigning individuals to either treatment or control groups because those groups are often not equivalent nor comparable. As a solution to the problem of making causal inferences about treatment effectiveness in the presence of selection bias, propensity score matching methods are employed to balance treatment and control group in many fields, including statistics (Rubin, 2006), medicine (Christakis & Iwashyna, 2003), economics (Abadie & Imbens, 2006), political science (Herron & Wand, 2007), sociology (Morgan & Harding, 2006), and even law (Rubin, 2001). Propensity Scores (PS) (Rubin & Rosenbaum, 1984), originally introduced to reduce selection bias and estimate treatment effects in observation studies, have become an important tool for achieving a causal inference in quasi-experimental data. Propensity Score Matching (PSM) is a statistical matching technique that attempts to estimate the effect of treatment accounting for the covariates that predict whether an individual received the treatment. However, due to the many different propensity matching 1

12 methods, researchers can be challenged in deciding which propensity score matching method should be used in their data analysis. This paper compares five major propensity score matching methods, including Nearest Neighbor Matching (1:1), Nearest Neighbor Matching (2:1), Caliper Matching, Stratification Matching, and Full Matching under different given sample sizes by generating simulated data and then using PSM with the MatchIt package in the statistical software R. Differences in the effectiveness of these methods in the simulated example data can be used to guide researchers are their choice of matching mechanism. Further, understanding the impact of varying sample sizes in this context can further illuminate the choice of matching mechanism in different settings for researchers. For the rest of this paper, I will introduce the different methods, present the simulation study I am using to compare those methods, summarize the results and present the conclusions and future suggestions. 1.1 Rubin Causal Model The idea behind the Rubin Causal Model is based on potential outcomes and the assignment mechanism such that an individual has different potential outcomes depending on the condition assigned. Rubin (1974) defines a causal effect as: Intuitively, the causal effect of one treatment E, over another C, for a particular individual and an interval of time from t 1 to t 2 is the difference between what would have happened at time t 2, if the individual had been exposed to E initiated at t 1 and what would have happened at t 2, if the individual had been exposed to C initiated at t 1. (p.689) According to Rubin s definition, τ i = Y i1 Y i0 2

13 where τ i denotes the treatment effort for individual i, Y i1 denotes the potential outcome for individual i in the treatment group, and Y i0 denotes the potential outcome for individual i in the control group. The observed outcome for individual i can be obtained by: where Y i = T i Y i1 + (1 T i )Y io 1, i {Treatment} T i = { 0, i {Control} Causal inference can also be thought of as a missing data problem, because for each individual i, only one potential outcome, either Y i1 or Y i0, can be observed at one time. For example, there may be a true causal effect on income if an individual obtains a Master s degree than not. In order to evaluate the causal effect on income of getting a Master s degree versus not, researchers would want to look at the outcome of income for the same individual in both treatment (with a Master s degree) and control (without a Master s degree) group. However, an individual can only be assigned into either treatment group or control. Therefore, if an individual is pursuing a Master s degree, researchers will fail to observe the potential outcome from that same individual not getting a Master s degree. Instead, the unit of analysis is changed to the group level such that causal inferences are applied through group mean difference. In the ideal case, if randomization is applied, observed and unobserved potential outcomes are balanced across treatment and control groups with high probability as the sample size increases, because random assignment makes it independent of outcome Y i1 and Y i0. As the notation {Y i0, Y i1 T i }. That is for j = 0, 1 3

14 E(Y ij T i = 1) = E(Y ij T i = 0) = E(Y i T i = j) Thus, the average treatment effect (ATE) is estimated as: τ = E(Y i1 T i = 1) E(Y i0 T i = 0) = E(Y i T i = 1) E(Y i T i = 0) Unfortunately, in most situations that randomization is not available and as such, covariates are almost never balanced across treatment and control groups (Jasjeet, 2011). The average treatment effect for the treated (ATT) is of interest due to the unbalance of treatment and control groups. τ (T = 1) = E(Y i1 T i = 1) E(Y i0 T i = 1) When randomization is achieved, we can estimate the average treatment effect (ATE) directly, otherwise, we need to take the average treatment effect for the treated (ATT) into consideration. By examining the observable covariates X, researchers are able to select individuals into treatment and control groups. This will result in selection bias when randomization is unavailable, but such selection bias can be reduced by the presumption of independence: E(Y ij T i, X i ) = E(Y ij X i ) (Heckman, Ichimura, Smith, & Todd, 1998). By following Rubin (1977), for j = 0, 1: E(Y ij X i, T i = 1) = E(Y ij X i, T i = 0) = E(Y ij X i, T i = j) where X i is the observed covariates for each individual i in treatment and control groups, thus, the average treatment effect (ATT) for the treated is estimated as τ (T = 1) = E{E(Y i X i, T i = 1) E(Y i X i, T i = 0) T i = 1} One route to achieve this comparison is through propensity score matching. 4

15 1.2 Propensity Score Matching Methods The idea behind propensity score matching is to match individuals in unbalanced treatment and control groups based on their probability of receiving treatment as estimated by the propensity score. While using propensity score matching, there are three main steps involved in the application of statistical matching: (i) Estimating propensity score, (ii) Matching individuals using the propensity score to create a balance in observed covariates across treatment and control groups; (iii) Evaluating the quality of the balance for matched data Estimating Propensity Score The Propensity Score (PS) is defined as a conditional probability of receiving a given treatment on a vector of observed covariates for individual i and treatment T i (Rubin & Rosenbaum, 1984). The propensity score is the estimated probability ps(i) = Pr(T i = 1 X i ) of assignment to the treatment given a set of observable covariates X i. Usually, researchers estimate propensity score by using logistic regression: ps(i) ln( 1 ps(i) ) = β 0 + β 1 X 1 + β 2 X β n X n ps(i) = exp (β 0 + β 1 X 1 + β 2 X β n X n ) 1 + exp (β 0 + β 1 X 1 + β 2 X β n X n ) where β 0 + β 1 X 1 + β 2 X β n X n is a linear regression for observed covariates. To apply this method, there are two assumptions that are referred as strong ignorable as stated in Rubin Causal Model: 1) Y i0, Y i1 T i X i, meaning potential outcomes are independent with treatment by controlling observable covariates X i. 5

16 2) 0 ps(i) 1, meaning the propensity score is a probability that is no less than 0 and no greater than 1. Keeping the control group as large as possible to increase the likelihood of finding better matching individuals for the treatment group is a general tendency while using propensity score matching. Generally, the rule of thumb is to choose a control group no more than nine times as large as the treatment groups (Bagley, White, & Golomb, 2001). Another essential issue is variable selection in the propensity score models. If selection is captured in the propensity score model, after the matching strategy is employed, selection bias should be eliminated (Victor, 2011). Omitting important variables would result in remaining selection bias and biased values for ATT (Marco & Sabine, 2005). There are two basic guides for variable selection (Caliendo & Kopeinig, 2005): Include all variables related to the outcome variables Only including known confounders will get biased and unbalanced results; Variables related to treatment but not related to outcome variable should not be included Matching individuals using the propensity score After estimating propensity score for each individual, researchers should select a propensity score matching method to perform the actual matching process. Figure one shows five of the main propensity score matching methods are Nearest Neighbor (1: 1) Matching, Nearest Neighbor (2: 1) Matching, Caliper Matching, and 6

17 Full Matching (Rubin, 1973; Althauser, R & Rubin, D. 1970; Rosenbaum & Rubin, 1984; Rosenbaum, 1991a): Most of the matching methods incorporate the caliper method to improve the quality of matching, because by setting an acceptable maximum distance for propensity score, matching methods cannot match individuals with distant propensity scores Nearest Neighbor Matching Nearest Neighbor Matching is the most straightforward matching method. In this method, one individual i from the control group is chosen as a matching individual for a treated individual by using the minimized difference between the estimated propensity score for treatment and control group regardless the distance between the propensity score between treatment and control group. C(P i ) = min j P i P j where i {Treatment Group}, j {Control Group}, C(P i ) is the selected matching set of control subject j matched to treated individual i, P i is the propensity score for treated individual i, and P j is the propensity score for control individual j. In other words, nearest neighbor matching aims to find the closest j in control to treated i. While using nearest neighbor matching method, there is a trade-off between precision and bias. Precision is the accuracy across matching methods, while bias comes from matching errors. There are two main reasons for this: a) Replacement Whether matching with replacement or matching without replacement. In the former case, an individual can be used more than once as 7

18 a match, that is for each individual i in treatment group matched with its nearest individual j in control. As a result, the bias is reduced and the average quality of matching is increased, but fewer individuals will be selected, which decreases the precision, because there may be more than one individual i that can be matched with individual j. But for the latter case, a individual can be used only once, meaning each individual i in treatment group is matched with its nearest unmatched individual j in control group, but the it may not be the closest one. b) Ratio What is the rate of matching regarding how many individuals are in treatment and control group? This will directly affect how many individuals will be included in the study. For example, in one study, there are totally 100 individuals, 30 in treatment group and 70 in control one. If one uses the nearest neighbor matching method with ratio 1: 1, there will be a maximum 60 individuals taken into consideration. But if the ratio is 2: 1, the maximum number of individuals involved could be 90. As a result, a lower ratio will reduce the bias and increase the average quality of matching, but lower the precision, while a higher ratio will rise the precision, but decrease the average quality of matching and increase the bias Caliper Matching Caliper matching (Cochran & Rubin, 1973) is a matching method developed from Nearest Neighbor Matching method, when the matched individual j is still statistically far away from the treated individual i. By setting a caliper, which is a 8

19 tolerance level δ on the maximum propensity score distance P i P j, researchers can avoid this problem. Formally: C(P i ) = min P i P j < δ j where i {Treatment Group}, j {Control Group}, C(P i ) is the selected matching set of control subject j matched to treated individual i, P i is the propensity score for treated individual i, P j is the propensity score for control individual j, and δ is the pre-specified tolerance (caliper). Performing a caliper matching means that individual j selected from the control group is the closest in terms of propensity score to individual i in treatment group, but within the caliper. Thus, it is important to select an appropriate caliper δ, since an inappropriate caliper especially when δ is small may result in few individuals being matched. Rubin and Cochran (1973) suggested guidance for determining an appropriate δ for each matching; however, later researchers noted that a possible draw back of caliper matching is that it is difficult to know what choice for the tolerance level is reasonable (Smith & Todd, 2005). By considering the Mahalanobis distance rather than propensity score, Rubin and Rosenbaum (1985) suggested that a caliper 0.2 standard deviation removes 98% of the bias if the variance in the treatment group is twice as large as the control group, and they generally suggest a caliper of 0.25 standard deviation of the propensity score. In this specific study, I followed their suggestion to use a caliper with tolerance δ = 0.25 SD. 9

20 Stratification Matching If the matching methods mentioned above regard all individuals as one component, the stratification matching method (or interval matching) categorizes both treatment and control groups based on their individual propensity score, and then groups individuals and matches them across k strata. For example, one group may contain the lowest stratum of the propensity score with individual i and j with the lowest performance observed. Thus, in stratification matching, the number and individuals are different from one group to another. Even though the number may vary, the average propensity score within each group between treatment and control group should not be systematically different from each other. To confirm this, the balance of stratification matching should be examined by a standard t-test. One question needs to be answered when conducting a stratification matching is how many strata should be used. In most situations, researchers use the quintile scale to stratify individuals into 5 different strata (Susanne, 2012). Five subclasses are often enough to remove 95% of the bias associated with one single covariate (Cochrane & Chambers, 1965), because almost all bias are associated with the propensity score, and under normality, five strata removes most of the bias associated with all covariates. In an individual analysis, checking the balance of the covariates within different groups can help researchers to justify the number of strata through the following: Most of the algorithms can be described in the following way: First, check if within a stratum the propensity score is balanced. If not, strata are too large and need to be split. If, conditional on the propensity score being balanced, the covariates are unbalanced, the specification of the propensity score is not adequate and has to be respecified. (Marco & Sabine, 2005) 10

21 After the stratification matching is performed and balance is satisfied, by using the mean difference for outcome variable between treatment and control group for each strata, researchers calculate the overall impact with weights proportional to the number of treatment individuals in each strata. If treatment individual does not match with control group in one particular strata, those individuals will be automatically omitted, because that stratum weights zero Full Matching Full matching is a particular type of stratification matching method that forms the subclass in an optimal way (Rosenbaum, 2002, Hansen, 2004). In full matching, researchers group individuals into a series of non-overlapping matched sets, taking all available individuals in the data. Each matched group contains one treated individual and multiple control individuals, or one control individual with multiple treated individuals. To achieve this, researchers still measure the distance d ij = P i P j between a given individual i in treatment group and a given individual j in control group. Then, they divide the full sample into some non-overlapping matched sets. Finally, they minimize the sum of the distance of all pairs of treated and controlled individuals within each matched set with all matched sets. d ij i T j C 2. Research Overview and Research Questions In this present study, several propensity score matching techniques are examined using data from a simulated analysis to show potential advantages and 11

22 disadvantages for those different techniques, while also varying sample size. The most effective method should include as many as individuals as possible in treatment and control group and obtain statistical significance for the matching method because a significant difference post-matching was simulated. To achieve this goal, this thesis used the mvtnorm package in R software to generate covariate data (Alan, et al., 2012). After this, propensity score matching methods were performed with MatchIt package in R (Ho, Stuart, Imai, & King, 2011), which intends to achieve propensity score matching methods including Nearest Neighbor Matching (1:1, 2:1), Caliper Matching, Stratification Matching and Full Matching. To analyze each matching result, an ANCOVA was conducted to compare matched and unmatched data, and for those different matching methods, t-tests were used to compare those matching methods. In an ideal situation, for the outcome variable, ANCOVA should not detect a significant difference on data before matching, but after matching, the t test should detect a significance difference, because after matching, the simulation was created for a situation where individuals in treatment should have increased outcomes. And finally, the average treatment effect for the treated (ATT) was calculated for those different propensity score matching methods. Thus, as discussed above, the research questions are: a. Which propensity matching method achieves the highest percent improvement? b. Which propensity matching method achieves the most balanced covariates? c. Are there differences in a) and b) by sample size? 12

23 For each simulation run, samples are randomly generated and randomly assigned into treatment and control group. Data is compared before and after matching to show the different of matching methods and how much percentage improved by different matching methods. 3. Methods 3.1 Variables In this present study, data was simulated using an example of education research assessing the effectiveness of an intervention in a quasi-experimental setting. For this simulated example, the dependent variable was the post-test (PosTe) score. The Individual Academic Program (IAP) serves as the dichotomous treatment variable in this study, where a value of 1 indicated that the individual received an Individual Academic Program to improve their academic skills. Other variables included in this study were age (Age), gender (Gender), Socioeconomic Status (SES), Individual Study Strategy (ISS), Intelligence Quotient Test Score (IQTe), and Pre-test Score (PreTe). Each of these variables is described in detail below: For the treatment IAP being assessed in this simulation study, each individual was assigned into either treatment or control group by using an indicator related to all other variables for details, see R code in Appendix D. Socioeconomic Status (SES) Score is measure of an individual s social and economic position in relation to others. The higher the Socioeconomic Status Score, the higher potential the students have Individual Study Strategy representing selecting bias and get higher score in their PreTe and PosTe. Socioeconomic status 13

24 typically falls into three categories, high SES, middle SES, and low SES, which were indicated in this study with 1, 0, and 1 respectively. Individual Study Strategy (ISS) was a dichotomous variable showing whether a student has their own study strategy after school. A value of 1 indicated that students have their own study strategy, while a value of 0 showed there was no individual study strategy. Generally, students with an individual study strategy will get a higher test score on PreTe and PosTe. Variables like Age (Age), Gender (Gender) and Intelligence Quotient Test Score (IQTe) were also take into consideration. The correlation matrix used for the study was as follows: Age Gender SES ISS IQTe PreTe Age 1 Gender 0 1 SES ISS IQTe PreTe Those values were based on presumed relationships in an example of a reallife setting. The variable Age was relative highly correlated with ISS and PreTe, because as Age increase, students develop their own ISS and since they will have more knowledge, they will have a higher PreTe score than younger students. Variable Gender has lower correlation with other variables, while for variable SES, it should have a relatively higher relation with PreTe, because students from high SES families have more accessible to recourses, including IQTe. 14

25 3.2 Data Generation By using the mvtnorm package in the R software, I generated data. Then I defined an indicator to predict whether an individual should be assigned into treatment group. This indicator correlates with all other variables through the vector: Indicator Age 0.4 Gender 0.2 SES 0.8 ISS 0.6 IQTe 0.9 PreTe 0.6 This indicator for each individual, and a random draw, was then used to act as a cue to assign students into treatment or control group, creating selection bias. An individual whose indicator value was greater than the 75 percentile of indicator across all individuals had a higher probability to be assigned into treatment group with IAP = 1, otherwise, the individual has a higher probability of being in control group with IAP = 0. After the assignment of IAP, the outcome variable PosTe was then calculated by taking all variables, including a true effect for IAP and error, into consideration. The correlation before error for PosTe and other variable was: PosTe Age 0.5 Gender 0.1 SES 0.7 ISS 0.4 IQTe 0.9 PreTe

26 The true effect of IAP was calculated by multiplying IAP with a true effect index, which in this study is 0.8, representing a significant effect. As a result, the true effect of IAP should be 0.8 IAP. Finally, a systematic error was calculated by a random norm distribution with mean 0 and standard 0.5. Therefore, as a result, the outcome variable PosTe was calculated as: ProTe = α i V i IAP + error where α i V i were the linear regression of other variables. The data were created to have a mean and standard deviation fitting the following pattern: M PreTe < M PosTe SD PreTe > SD PosTe The difference (Diff) between pre-test (PreTe) and post-test (PosTe) was given to create the growth z-score (ZGrowth) for each individual based on their post-test after treatment. The mean and standard deviation was still calculated for PreTe and PosTe. For different sample sizes a small sample size (n = 1000), a medium sample size (n = 2500) and a large sample size (n = 5000) 1000 runs of simulation were conducted. After all simulations, the average statistics were pulled out by a defined function called mean. list(). In order to make a difference among different matching methods, I manipulated the ratio of treatment and control group to 1: 3, meaning 1/4 of the Sample. Size was in treatment group, and 3/4 of the Sample. Size was in control group. 16

27 3.3 Propensity Score Matching After all the data was generated matching was conducted with each of the five propensity score matching methods Nearest Neighbor Matching (1: 1, 2: 1), Caliper Matching, Stratification Matching and Full Matching with MatchIt package in R. 3.4 Analysis of Covariance for Data Before Matching For each different generated dataset, a ANCOVA was conducted before applying matching methods in order to see whether there is a significant difference in the outcome variable of the generated data. In this particular research setting, results for ANCOVA should not be significant. The model of ANCOVA was: PreTe i = μ + IAP i + α i V i + error i where μ is the overall mean, IAP i is the treatment effect, and α i V i is the linear regression for other variables. For matched data under different matching methods, an independent sample t-test was conducted for the z-score growth (ZGrowth) to determine whether the matching result was statistically significant or not. ANCOVA served as a basis of comparison for the implemented propensity score methods. Then, the average effect of the treatment on the treated (ATT) was calculated for each propensity score matching methods. Finally, I calculated the percentage of bias reduced to show the balance improved by different methods, by subtracting the standardized mean for matched from the unmatched data and dividing the standardized mean difference by the unmatched data: 17

28 RB = MD unmatched MD matched MD unmatched 100% where RB represents the reduced bias, MD represents the mean difference. All analysis including ANCOVA, t-test and confidence intervals made it possible to examine whether any of the propensity score matching methods were substantially differ from other methods, so that to make a best selection from those analysis. 3.5 Selecting the Appropriate Method The results of those were simulated 1000 times and the results synthesized. Based on analysis results from the ANCOVA and independent t test, the appropriate propensity score matching methods could be determined by examining the sample size, average mean difference, balance improvement on covariates and the percent of improvement. An appropriate propensity score matching method should result in a high balanced improvement on covariates and a high percent of improvement of selection bias. Following Rubin s (2001) suggestion, I examined whether the mean difference for different propensity matching methods was less than. 25. If it was greater than. 25 under a given sample size, such a method was not appropriate. Then, I used the percentile of matched data to examine whether any of those methods significantly improved the dada. The methods that reduced the most percentage of bias were considered the best. In order to present and explain the results, I first explained the results of the simulation, including descriptive characteristics of the variables in the datasets, the 18

29 result of the ANCOVA analysis of the unmatched data, and the matching results based on related matching methods. Then I explain the result of the independent t test in detail for the different matching methods. The entire code for this study is provided in Appendix D. 4. Results 4.1 Understand the Generated Data It is important to first understand the generated sample. Table 1 shows the standardized summary of generated data with sample size For variable Age, it had a mean 18 with standard deviation 2. From Table 1, the youngest sample was 12 years old, while the oldest one was 24 years old. For the binomial variable Gender, a positive value (Gender 0) represents male, while a negative value (Gender < 0) stands for female. The mean was 0.11 for Gender and the median was 0.09, showing there was a slight difference in the amount of males and females. For the variable SES, it separated into three categories 1/3 of the sample sizes with Low SES, 1/3 with median SES, while the rest 1/3 with High SES. Like Gender, the variable ISS was considered dichotomously, where a positive value (ISS 0) represents students who had Individual Study Strategy, and a negative value (ISS < 0) stands for not have Individual Study Strategy. From Table 1, the ratio of students who have ISS and who does not have ISS was near 1: 1. For variables IQTe, PreTe and PosTe, they were all normal distributed. For PosTe and PreTe, PosTe has a higher mean and a narrower range, because, the data were generated to replicate an education setting where people prefer not only a higher 19

30 mean score, but also a narrower standard deviation. Diff was calculated by the formula: Diff = PosTe PreTe. ZGrowth was calculated based on variable Diff. At the baseline observation, there are six sub small categories for the generated dataset. They were: High SES students with ISS High SES students without ISS Medium SES students with ISS Medium SES students without ISS Low SES students with ISS Low SES students without ISS The average percentage of the entire 1000 sample size for all 1000 simulation run for each category was shown in Table Reducing Distance Using Propensity Scores After generating the data, propensity scores were estimated based on Age, Gender, SES, ISS, IQTe and PreTe using the following model: IAP ~ Age + Gender + SES + ISS + IQTe + PreTe Before matching, the standardized mean difference is , which is a large distance that should result in a high selection bias. Propensity score matching methods are intended to reduce this mean difference as much as possible, the more they are reduced, the better the propensity score matching method is. After applying propensity score matching, the standardized mean difference is decreased close to 0 for Nearest Neighbor (1: 1), Nearest Neighbor (2: 1), Caliper, and Full matching. For Caliper Matching Method, there is a restricted variability for propensity score 20

31 matching method, because, unlike the Nearest Neighbor (1: 1) and the Nearest Neighbor (2: 1), there is a restriction on the distance between estimated propensity score. That is the reason why for Caliper matching methods, the standard deviation of the control group is smaller than all other methods. Tables 3 provide information of the average standard mean difference in propensity score between the treatment group and control group across all simulation runs for different matching methods with unmatched data and matched data. According to Table 3, it is clear that comparing with unmatched data, all other matching methods other than Stratification Matching performed well in reducing the standardized mean difference. 4.3 Results for Different Matching Methods I applied the nearest neighbor (1: 1), nearest neighbor (2: 1), caliper with 0.25 SD, stratification with 5 subclasses, and full matching method to the generated data to get improved balanced treatment individuals with control individuals. Table 4 and Figure 2 show the average matching result for different matching methods. From Table 4, it is clear that, for the 1000 generated data, there are 690 individuals in control group and 310 individuals in treatment group on average. For nearest neighbor (1: 1), nearest neighbor (2: 1), all treated individuals have matched with control group, as a result, there are totally 620 individuals involved in matching. For caliper matching, not all individuals in treatment group matched with control group, because there is a caliper for maximum distance. In this matching method, there are 510 individuals involved in matching. 21

32 For Stratification and Full matching, all individuals are included in the matching. It is important to take how many individuals are involved in matching, because the more individuals involved in matching, the less bias. Finally, for all the matching methods, there were no discarded individuals, meaning all individuals in treatment and control group can be used in matching method in this particular study. However, for a real data analysis, the discarded individual will not necessarily be zero. Figure 3 is a jitter plot of distribution of propensity score by different matching methods which shows how individual in treatment and control group matched with each other. Nearest neighbor (1:1), nearest neighbor (2:1), and caliper matching method have more individuals in control, and as a result, the extreme values in different groups are eliminated. Because not all individuals are involved in the matching method, there is more bias, though it is more precise. For stratification matching and full matching, all individuals were included in matching methods. Such results are with less bias but a lower precision. It is very important to keep the trade-off between bias and precision in mind, because it is an essential factor in selecting the best and appropriate matching method. Figure 4 shows the histograms of the propensity score for both matched and unmatched individuals in both treatment and control group. Distributions of propensity score for treatment are identical for both matched and unmatched data for all those matching methods except for caliper matching, because all individuals in treatment group are involved in those matching methods. For caliper matching 22

33 method, the distribution is slightly different in matched and unmatched data, because not all individuals involved in matching. Another important point is that the distributions of propensity score for control are identical for both matched and unmatched data for stratification and full matching method, because all individuals are used in matching methods. But for nearest neighbor (1:1), nearest neighbor (2:1), and caliper matching method, distributions are different. In order to select the best matching methods, I examined whether the standard mean difference for each individual covariate and propensity score is balanced or not. Table 5 shows the average standardized mean difference for each of the covariates and propensity score for all matching method. According to Table 5, before matching, the mean difference between treatment and control group is , for nearest neighbor (1: 1) under sample size While after matching, the mean difference is only It is clear that without applying propensity score matching methods, there are more bias. And there is a 96%, 50%, 100%, 86% and 98% percent improvement, respectively, after applying different matching methods on average across the 1000 simulation runs. Thus, the order from best to worst is: Caliper > Full > Nearest Neighbor (1: 1) > Stratification > Nearest Neighbor (2: 1) Before determining the best method, the statistical significance of the difference must be examined, so I ran the independent sample t test to examine it. 23

34 4.4 Results for independent t test For matched data, an independent t test was conducted to help decide whether there is a significant mean different between data before matching and data after matching. Table 6 provides the result for independent t test across all simulation runs. For each different matching method, the degree of freedom is greater than 120, and for a two-tail t test, the critical t value is It is clear that the independent t test shows a statistical significance between data before matching and after matching. Those results are reasonable, since propensity score matching methods aim to decrease the selection bias. 4.5 Result for ANCOVA on data before matching Before matching, an analysis of covariance (ANCOVA) is conducted to see whether there is a significance difference between the mean difference of treatment and control before matching: PreTe i = μ + IAP i + α i V i + error i where μ is the overall mean, IAP i is the treatment effect, and α i V i is the linear regression for other variables. The result for the ANCOVA across all simulation runs is shown in Table 7. According to Table 7, ANCOVA failed to detect a significant difference on data before matching, which is different than the t test results. This is reasonable, since after applying propensity score matching methods, it is supposed to be a significance difference for treated and controlled individuals on outcome variable. 24

35 It is also need to point out that, as sample size increase, the percent improvement increase for all propensity score matching methods other than Full matching, and the mean difference decrease. More individuals in treatment and control group results in a better matching result. 5. Conclusion In this present study, I intended to make an overall decision selecting the best and most appropriate propensity score matching method among several matching methods under difference sample sizes and whether any of those matching methods improve balance substantially. Table 9 provides the comparison of reduction, mean difference, and t-value for all matching methods under difference sample size. From Table 8, it is clear that all of the matching methods achieve statistical significance after matching under difference sample size. The percent improved Caliper > Full > Nearest Neighbor (1: 1) > Stratification > Nearest Neighbor (2: 1) Mean Difference Stratification > Nearest Neighbor (2: 1) > Nearest Neighbor (1: 1) > Full Caliper It is essential to notice that with the sample size increase, the percent improvement for each individual propensity matching method increases, while the mean difference decreases, except for full matching method. This is reasonable because as sample size increase, individuals in both treatment and control group increase and the more individuals are involved in matching, the bias decrease, and 25

36 precision increase. On the other hand, the percent improvement for the full matching method decreases along with an increase of sample size. Future study is needed to detect why this happens. According to my study, caliper matching method result the highest percent improvement under all given sample size and the most balanced covariates. Sample size does not affect the result among all those propensity score matching methods. In conclusion, first, propensity score matching methods do really reduce the selection bias and create comparable matching groups. Next, this study suggests that the larger sample size involved, the more accurate matching results will be. As a result, I recommend using propensity score matching methods as a practical approach when conducting matching, especially the methods of caliper matching and full matching, because caliper matching is the most precise methods in this particular study which reduce the most bias, while full matching using the most individuals. If the sample size is too small to use caliper matching method, full matching is highly recommended. 6. Discussion In educational research, experimental designs with random sampling and random assignment provide the strongest evidence for making casual inference decisions. However, it is impractical in many real settings to achieve randomization. Therefore, research in educational settings is often based strongly on observational data or quasi-experiments. Propensity score matching methods can serve as a useful tool for educational researchers to obtain causal inference without randomly assigned data. 26

37 In my simulation, all of the considered propensity score matching methods worked well on different sample sizes, and result in an improved and balanced set of matched data. This balance is supported by both an insignificant difference in the ANCOVA result on covariates data before matching and a significant difference t test result in the data after matching. My study suggests that the caliper matching method performs best in minimizing the propensity score distance under small (n = 1000), medium (n = 2500), and large (n = 5000) sample sizes, but it always involves the least number of individuals, which increase its bias. Additionally, Rubin and Rosenbaum s (1985) suggestion to use 0.25 SD as the caliper was followed. However, later researchers argue that Rubin and Rosenbaum make use of 0.25 SD by considering matching on the Mahalanobis distance, which is not direct related to propensity score. Pawel (2011) made a slight modification on caliper mechanism to make the selection of caliper link with estimated propensity scores: C(P i ) = min P i P j < δp i j δ > min j P i P j P i As Pawel claimed, by directly linking caliper with estimated propensity score, the modified mechanism will result in better matches from the control group. Future study is needed to replace 0.25 SD with this modified mechanism to see whether the modified mechanism results in a better matching. Another limitation for this study is about the correlation matrix. I defined a reasonable correlation matrix based on experience and knowledge. If for real data 27

38 analysis, and if the correlation matrix is not the same or similar to my generated correlation matrix, the results for different matching methods may vary. Future studies should derive correlation matrices from existing educational datasets and possible vary the relationships in the data to cover a wider array of potential settings. When applying matching methods, the ideal result would be matching as many as individuals in control with individuals in treatment group on all relevant covariates. However, in practical settings, it is difficult to achieve this goal, because if achieved, the sample size will be reduced substantially. Selecting a proper propensity score matching method will provide a beneficial compromise. I had hypothesized the ranking of the propensity score matching methods as: Caliper > Nearest Neighbor (1: 1) > Full > Stratification > Nearest Neighbor (2: 1) However, after this study, the ranked result is: Caliper > Full > Nearest Neighbor (1: 1) > Stratification > Nearest Neighbor (2: 1) Finally, the matching results on propensity score result in very similar groups, while the unmatched data are in very dissimilar groups. To examine dissimilarity, I conducted an analysis of covariance (ANCOVA) on data before matching. There is an assumption under the ANCOVA approach, that the groups involved should be either similar to all covariates other than before matching, or irrelevant variables. In this particular study, covariates distributions were similar before and after propensity score matching methods. In all propensity score 28

39 matching methods, a significant difference on outcome variable was detected, while ANCOVA failed to detect a significant difference on outcome variable. Such simulation results suggested that the propensity score matching methods achieve in a more accurate results than ANCOVA. Besides these findings, as in many studies, some questions still remain. Also, it is of interest to compare those propensity score matching methods with some other existing matching methods, like Weighting Adjustments, Mahalanobis Metric Matching, and optimal matching. Even within propensity score matching methods, it is also important to examine how much balance of covariates matters among all those methods. Those future studies could provide more evidence for propensity score matching methods and how different propensity score matching methods make individual covariates balanced. It should also be explored why for the full matching method, there is a decrease of percent improvement while sample size increases. 29

40 7. Reference Abadie, A., & Imbens, G. (2006). Large Sample Properties of Matching Estimators for Average Treatment Effects. Econometrica, 74, Alan, G., Frank, B., Tetsuhisa, M., Xuefei, M., L, F., Fabian, S., et al. (2012, 12 10). mvtnorm: Multivariate Normal and t Distributions. Retrieved 1 19, 2013, from The R Project for Statististical Computing: Alberto, A., & Guido, W. I. (2009). Matching on the Estimated Propensity Score. NBER Working Paper Series No , Althauser, R., & Rubin, B. D. (1970). The Computerized Constriction of a Matched Sample. American Journal of Sociology, 76, Austin, P. C., Grootendorst, P., & Anderson, G. M. (2007). A Comparison of the Ability of Different Propensity Score Models to Balance Measured Variables Between Treated and Untreated Subjects: A Monte Carlo Study. Statistics in Medicine, 26, Bagley, S. C., White, H., & Golomb, B. A. (2001). Logistic Regression in the Medical Literature: Standards for Use and Reporting, with Particular Attention on the Medical Domain. Journal of Clinic Epidemiol, 54, Brookhart, M. A., Schneeweiss, S., Rothman, K. J., Glynn, R. J., Avorn, J., & Sturmer, T. (2006). Variable Selection for Propensity Score Models. American Journal of Epidemiology, 163 (12), Caliendo, M., & Kopeinig, S. (2005). Some Practical Guidance for the Implementation of Propensity Score Matching. Christakis, N. A., & Iwashyna, T. I. (2003). The Health Impact of Health Care on Families: A matched cohort study of hospice use by decedents and mortality outcomes in surviving, windowed spouses. Social Science and Medicine, 57 (3), Cochran, W., & Rubin, D. B. (1973). Controlling Bias in Observational Studies. Sankyha, 35, Cochrane, W., & Chambers, S. (1965). The Planning of Observational Studies of Human Population. Journal of the Royal Statistical Society, 128, Couper, M. P. (2000). Web Surveys: A Review of Issues and Approaches. Public Opinion Quarterly, 64,

41 D'Agostino, R. B. (1998). Jr. Tutorial in Biostatistics: Propensity Score Methods for Bias Reduction in Comparison of a Treatment to a Non-randomized Control Group. Stat Med, 17, Dehejia, R., & Wahba, S. (1999). Causal Effects in Non-experimental Studies: Reevaluation of the Evaluation of Training Programs. Journal American Statistic Association, 94, Dehejia, R., & Wahba, S. (2002). Propensity Score Matching Methods for Nonexperimental Causal Studies. Rev Econ Stat, 84, Fan, X., & Nowell, D. L. (2011). Using Propensity Score Matching in Educational Research. Gifted Child Quarterly, Feng, J., & Kai, X. (1983). A Comparison of Propensity Score Methods for Evaluating the Effects of Programs with Multiple Versions. Gu, X., & Rosenbaum, P. R. (1993). Comparison of Multivariate Matching Methods: Structures, Distances, and Algorithms. Journal of Computational and Graphical Statistics, 2, Hackman, J. (1979). Sample Selection Bias as a Specification error. Econometrica (47), Hansen, B. B. (2004). Full Matching in an Observational Study of Coaching for the SAT. Journal of the American Statistical Association, 99, Heckman, J. J., Ichimura, H., Smith, J., & Todd, P. (1998). Characterizing Selection Bias Using Experimental Data. Econometrica, 66 (5), Herron, M., & Wand, J. (2007). Assessing Partisan Bias in Voting Technology: The Case of the 2004 New Hampshire Recount. Electoral Studies, 26 (2), Hirano, K., & Imbens, G. W. (2001). Estimation of Casual Effects Using Propensity Score Weighting: An Application to Data on Right Hear Catheterization. Health Services and Outcomes Research Methodology, 2, Ho, D., Stuart, E., Imai, K., & King, G. (2011, 10 24). MatchIt: MatchIt. Retrieved 1 19, 2013, from The R Project for Statistical Computing: Imbens, G. W. (2000). The role of the propensity score in estimating dose-response functions. Biometrika, 87, Jasjeet, S. S. (2011). Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R. Journal of Statistical Software, 42 (7). 31

42 LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs. Am Econ Rev, 76, Marco, C., & Sabine, K. (2005). Some Practical Guidance for the Implementation of Propensity Score Matching. Discussion Paper Series. Masafumi, F. (2011). Effects of Variables in a Response Propensity Score Model for Survey Data Adjustment: A Simulation Study. Behaviormetrika, 38 (1), Morgan, S. L., & Harding, D. J. (2006). Matching Estimators of Causal Effects: Prospects and Pitfalls in Theory and Practice. Sociological Methods and Research, 35 (1), Needleman, S., & Wunsch, C. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48 (3), Onur, B. (2006). Too Much Ado about Propensity Score Models? Comparing Methods of Propensity Score Matching. International Society for Pharmacoeconomics and Outcomes Research, Parsons, L. S. (2001). Reducing bias in a propensity score matched-pair sample using greedy matching techniques. In SAS SUGI 26, Pawel, S. (2011). Dynamic Caliper Matching. entral European Journal of Economic Modeling and Econometrics, Perkins, S. M., Tu, W., Underhill, M. G., Zhou, X. H., & Murray, M. D. (2000). The use of propensity scores in pharmacoepidemiological research. Pharmacoepidemiology and drug safety, 9, Rosenbaum, P. R. (1991). A characterization of optimal designs for observational studies. Journal of the Royal Statistical Society, 53 (3), Rosenbaum, P. R. (2002). Observational Studies, 2nd Edition. New York, NY, United States: Springer Verlag. Rubin, D. B. (1977). Assignment to a Treatment Group on the Basis of a Covariate. Journal of Educational Statistics, 2, Rubin, D. B. (1974). Estimating Casual Effects of Treatment in Randomized and Nonrandomized Studies. Journal of Educational Psychology, 66 (5), 689. Rubin, D. B. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127, Rubin, D. B. (2006). Matched Sampling for Causal Effects. New York: Cambridge University Press. 32

43 Rubin, D. B. (1973). Matching to Remove Bias in Observational Studies. Biometrics, 29, Rubin, D. B. (2001). Using Propensity Scores to Help Design Observational Studies: Application to the Tobacco Litigation. Health Services and Outcome Research Methodology, 2 (1), Rubin, D. B., & Rosenbaum, P. R. (1985). Constructing Control Group Using Multivariate Matching Sampling Methods that Incorporate Propensity Score. The American Statistician, 39 (1), Rubin, D. B., & Rosenbaum, P. R. (1984). Reducing bias in observational studies using sub-classification on the propensity score. Journal of the American Statistical Association, 79, Rubin, D. B., & Rosenbaum, P. R. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, Smith, J., & Todd, P. (2005). Does Matching Overcome LaLonde s Critique of Nonexperimental Estimators? Journal of Econometrics, 125 (1-2), Susanne, S. (2012, October 29). Propensity Score Based Data Analysis. Retrieved Feb 19, 2013, from Thomas, N., & Rubin, D. B. (1996). Matching Using Estimated Propensity Scores: Relating Theory to Practice. Biometrics, 52 (1), Victor, M. (2011). What Is Selection and Endogeneity Bias and How Can We Address It? University of Washington, Seattle. York, R. O. (1998). Conducting Social Work Research. Boston: Allyn and Bacon. 33

44 Figure 1: Different Propensity Matching Methods Considered In This Paper Nearest Neighbor Matching (1:1) Full Matching Propensity Score Matching Methods Nearest Neighbor Matching (2:1) Stratified Matching Caliper Matching 34

45 Appendix A1: Results for 1000 Sample Size Table 1: Standardized Summary of Generated Data with 1000 Sample Size Age Gender SES ISS IQTe Min. : st Qu.: Median : Mean : rd Qu.: Max. : PreTe IAP PosTe Diff ZGrowth Min. : st Qu.: Median : Mean : rd Qu.: Max. :

46 Table 2: Average Proportions for SES and ISS With ISS Without ISS High SES Medium SES High SES

47 Table 3: Standardized Mean Difference on Propensity Score with 1000 Sample Size Data after applying matching methods across all simulation runs Means Treated Means Control SD Control Mean Diff Before Matching Nearest Neighbor (1:1) Nearest Neighbor (2:1) Caliper Matching Stratification Full Matching

48 Table 4: Matching Results for Different Matching Methods under 1000 Sample Size Methods Control Treated All Nearest Matched Neighbor Unmatched (1:1) Discarded 0 0 Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full All Matched Unmatched 70 0 Discarded 0 0 All Matched Unmatched Discarded 0 0 All Matched Unmatched 0 0 Discarded 0 0 All Matched Unmatched 0 0 Discarded

49 Table 5: Average Standardized Mean Difference for covariates and propensity score of different matching methods under different sample size. Methods Nearest Neighbor (1:1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full Before Matching After Matching Sample Percent Means Means Mean Means Means Mean Size Improvement Treated Control Diff Treated Control Diff

50 Table 6: Results for Independent t test for different matching methods with 1000 sample size. Nearest Neighbor (1:1) Nearest Neighbor (2:1) Caliper Stratification Full t-value p-value α-level % CI Upper Lower H 0 : The difference between the means is 0, μ Before Matching = μ After Matching H 1 : The difference between the means is not 0, μ Before Matching μ After Matching 40

51 Table 7: Overall ANCOVA Results with 1000 sample size. Df Sum Sq Mean Sq F value Pr(>F) Age E-86 Gender E-30 SES E-95 ISS E-10 IQTe E-25 PreTe Residuals

52 Table 8: Combined Overall Comparison for Propensity Score Matching Methods under Different Sample Size. Methods Sample Size Mean Diff Percent Improvement t-value Nearest Neighbor (1:1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full

53 Table 9: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 1000 sample size. Before Matching After Matching Percent Methods Means Means Mean Means Means Mean Improvement Treated Control Diff Treated Control Diff distance Age Nearest Gender Neighbor SES (1:1) ISS IQTe PreTe Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full distance Age Gender SES ISS IQTe PreTe distance Age Gender SES ISS IQTe PreTe distance Age Gender SES ISS IQTe PreTe distance Age Gender SES ISS IQTe PreTe

54 Figure 2: Matching Results for Different Matching Methods with 1000 Sample Size 44

55 Figure 3: Jitter Plot of Distribution of Propensity Score for different methods with 1000 Sample Size. Nearest Neighbor (1: 1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 subclasses Full Matching 45

56 Figure 4: Average Histograms of Propensity Score for Matched and Unmatched Individual in Both Treatment and Control Groups for All Runs of Simulation with 1000 sample size. Nearest Neighbor (1: 1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full Matching 46

57 Appendix A2: Results for 2500 Sample Size Table 10: Standardized Summary of Generated Data with 2500 Sample Size Age Gender SES ISS IQTe Min. : st Qu.: Median : Mean : rd Qu.: Max. : PreTe IAP PosTe Diff ZGrowth Min. : st Qu.: Median : Mean : rd Qu.: Max. :

58 Table 11: Matching Results for Different Matching Methods under 2500 Sample Size Methods Control Treated All Nearest Matched Neighbor Unmatched (1:1) Discarded 0 0 Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full All Matched Unmatched Discarded 0 0 All Matched Unmatched Discarded 0 0 All Matched Unmatched 0 0 Discarded 0 0 All Matched Unmatched 0 0 Discarded

59 Table 12: Results for Independent t test for different matching methods with 2500 sample size. Nearest Neighbor (1:1) Nearest Neighbor (2:1) Caliper Stratification Full t-value p-value α-level % CI Upper Lower H 0 : The difference between the means is 0, μ Before Matching = μ After Matching H 1 : The difference between the means is not 0, μ Before Matching μ After Matching 49

60 Table 13: Overall ANCOVA Results with 2500 sample size Df Sum Sq Mean Sq F value Pr(>F) Age E-228 Gender E-82 SES E-235 ISS E-22 IQTe E-69 PreTe Residuals

61 Table 14: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 2500 sample size. Methods Before Matching After Matching Percent Means Means Mean Means Means Mean Improvement Treated Control Diff Treated Control Diff distance Age Nearest Gender Neighbor SES (1:1) ISS IQTe PreTe Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full distance Age Gender SES ISS IQTe PreTe distance Age Gender SES ISS IQTe PreTe distance Age Gender SES ISS IQTe PreTe distance Age Gender SES ISS IQTe PreTe

62 Figure 5: Matching Results for Different Matching Methods with 2500 Sample Size 52

63 Figure 6: Jitter Plot of Distribution of Propensity Score for different methods with 2500 Sample Size Nearest Neighbor (1: 1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 subclasses Full Matching 53

64 Figure 7: Average Histograms of Propensity Score for Matched and Unmatched Individual in Both Treatment and Control Groups for All Runs of Simulation with 2500 sample size. Nearest Neighbor (1: 1) Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 subclasses Full Matching 54

65 Appendix A3: Results for 5000 Sample Size Table 15: Standardized Summary of Generated Data with 5000 Sample Size Age Gender SES ISS IQTe Min. : st Qu.: Median : Mean : rd Qu.: Max. : PreTe IAP PosTe Diff ZGrowth Min. : st Qu.: Median : Mean : rd Qu.: Max. :

66 Table 16: Matching Results for Different Matching Methods under 5000 Sample Size Methods Control Treated All Nearest Matched Neighbor Unmatched (1:1) Discarded 0 0 Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full All Matched Unmatched Discarded 0 0 All Matched Unmatched Discarded 0 0 All Matched Unmatched 0 0 Discarded 0 0 All Matched Unmatched 0 0 Discarded

67 Table 17: Results for Independent t test for different matching methods with 5000 sample size. Nearest Neighbor (1:1) Nearest Neighbor (2:1) Caliper Stratification Full t-value p-value α-level % CI Upper Lower H 0 : The difference between the means is 0, μ Before Matching = μ After Matching H 1 : The difference between the means is not 0, μ Before Matching μ After Matching 57

68 Table 18: Overall ANCOVA Results with 5000 sample size Df Sum Sq Mean Sq F value Pr(>F) Age Gender E-186 SES ISS E-48 IQTe E-147 PreTe Residuals

69 Table 19: Average Standardized Mean Difference for covariates and propensity score of different matching methods with 5000 sample size. Methods Before Matching After Matching Percent Means Means Mean Means Means Mean Improvement Treated Control Diff Treated Control Diff distance Age Nearest Gender Neighbor SES (1:1) ISS IQTe PreTe Nearest Neighbor (2:1) Caliper with 0.25 SD Stratification with 5 Subclasses Full distance Age Gender SES ISS IQTe PreTe distance Age Gender SES ISS IQTe PreTe distance Age Gender SES ISS IQTe PreTe distance Age Gender SES ISS IQTe PreTe

70 Figure 8: Matching Results for Different Matching Methods with 5000 Sample Size 60