A Comparative Assessment of Disclosure Risk and Data Quality between MASSC and Other Statistical Disclosure Limitation Methods By Feng Yu and Neeraja Sathe RTI International is a trade name of Research Triangle Institute. www.rti.org
Objective of this paper Compare these 3 methods of Statistical Disclosure Limitation (SDL): MASSC (an RTI product) Post Randomization (PRAM) using R Random swapping using SAS 2
What is disclosure? Disclosure refers to inappropriate attribution of information on a data subject, whether an individual or an organization Disclosure occurs when: A data subject is identified on a released file (identity disclosure) Sensitive information about a data subject is revealed through the released file (attribute disclosure) It becomes possible to determine the value of some characteristic of a data subject more accurately than it would have been otherwise (inferential disclosure) 3
Examples of Identifying and Sensitive data Direct identifiers Social security numbers, addresses, names, etc. Indirect identifiers a combination of variables such as gender, race, and occupation (e.g. female, Asian, astronaut) Sensitive data substance use, criminal activity, health outcomes, income, etc. 4
Types of Intrusion Outside intrusion occurs when an intruder tries to identify a sample record by matching it to an external database without prior knowledge of who is in the sample. Inside intrusion occurs when an unauthorized person tries to link a record in a microdata file to an identifiable respondent the intruder knows in the file. 5
What is statistical disclosure limitation (SDL)? SDL are techniques applied to released statistical data that minimize or limit the potential for individual identification. Before releasing statistical tables or microdata files, federal agencies use a variety of statistical methods to protect their data and to ensure that the risk of disclosure is very small. In addition to being ethical and needed to assure adequate survey response rates, it is the law: Confidential Information Protection and Statistical Efficiency Act (CIPSEA). 6
Current common SDL Methods For tabular data Cell suppression, controlled rounding, synthetic substitution For microdata Restrict data dissemination, strip off direct identifiers, topcode or bottomcode sensitive items, collapse categories, random swapping, perturbation, generate synthetic data, etc. 7
Details on Our Research MASSC is an SDL method for treating microdata files developed at RTI. We compare MASSC with two other SDL methods by examining the degree to which the use of each affects data quality and lowers disclosure risk. The other methods are Random swapping, and Post RAndomisation Method (PRAM). 8
Method 1: Random Swapping (using SAS) Random Swapping (Dalenius and Reiss, 1978) is an SDL technique used for categorical variables. Data containing sensitive information are swapped so that it is difficult for an intruder to definitively identify any individual. Confidentiality is protected by introducing uncertainty about sensitive data values (SVs). Consistency checks ensure that logical swaps are executed. Certain statistical inferences are preserved by retaining marginal distributions. 9
Method 2: PRAM (using R, package called sdcmicro) PRAM (Gouweleeuw et al., 1998) is also an SDL technique for categorical variables. It is analogous to noise addition to values of continuous variables. PRAM when applied to a categorical variable, alters each record on that variable using a pre-selected probability mechanism. 10
Method 3: MASSC MASSC (Singh, 2002) consists of the following four major steps: Micro Agglomeration - partitions data into risk strata based on a set of selected identifying variables(ivs). Substitution - replaces IVs of the randomly selected records with those of substitution donors subject to a set of bias constraints. Subsampling randomly deletes some of the records from the data subject to a set of variance constraints. Calibration adjusts weights in the subsample to the original total weights in the full analytic file. 11
Comparing MASSC with Other Two Methods We conducted simulations based on a random sample of combined 2006 and 2007 National Survey on Drug Use and Health (NSDUH) public use files Treatment rates: 10% and 20% Simulations: 100 times per treatment rate For risk assessment, we calculated the matching rates that a record in the treated sample could be correctly linked to the corresponding record in the population For utility assessment, we compared the effects on estimated means and regression-model parameters 12
Matching Rate (%) Results - Risk Assessment Matching Rates (%) from Simulations (n=100) w.r.t. All IVs 90 80 70 60 10% Treatment Rate MASSC SWAP PRAM 50 40 30 20 10 0 Exact Match Probability Match Distance Match 13
Estimates comparisons Results - Utility Assessment Treat ment Rate Summary Statistics Estimates (n=340x100) MASSC SWAPPING PRAM Ratio_EST Ratio_SE Ratio_EST Ratio_SE Ratio_EST Ratio_SE 10% Max Min Mean 1.05 1.10 3.67 3.67 13.70 13.11 0.93 0.93 0.94 0.95 0.90 0.88 1.00 1.02 1.01 1.01 1.07 1.06 Where Ratio_EST p p i i0 and Ratio_SE SE SE i i0 14
Results - Utility Assessment (cont d) Regression comparison- Change of Significance n = 220 x 100 MASSC (Average, range) Swapping (Average, range) PRAM (Average, range) Sig. to Non-Sig. Non-Sig. to Sig. Sig. to Non-Sig. Non-Sig. to Sig. Sig. to Non-Sig. Non-Sig. to Sig. 10% Treatment Rate 3.29 (0-8) 1.99 (0-6) 3.02 (0 8) 2.13 (0 8) 6.26 (0 15) 3.26 (0 12) 15
Summary of Simulation Results All three methods provide a certain degree of confidentiality protection to the data; as the overall treatment rate increases, the matching rates decreases. With all three methods the data quality decreases as the overall perturbation rate increases. When random swapping is properly designed, it is similar to MASSC. PRAM appeared to be less appealing than the other two methods. 16
Summary of Simulation Results (cont d) MASSC has a strong theoretical background and it provides simultaneous protection on data confidentiality and data quality. MASSC tends to provide more opportunities for better disclosure treatment and the treated data quality is preserved on average. Since MASSC involves a subsampling step, the suppressed records are guaranteed to have no disclosure risk. Thus, this method is better than the others at protecting against inside intrusion. Due to the interactive features of MASSC, it needs more labor and computer time than the other two methods. 17
Future Work Develop/use other risk assessment methods to calculate disclosure risk. Compare three methods using other sets of survey data. Develop other distance functions to be used in SAS. 18
References Dalenius and Reiss,1978, Data-swapping: A technique for disclosure control (extended abstract). American Statistical Association, Proceedings of the Section on Survey Research Methods, Washington, DC, 191 194. Duncan et al. (1993). Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics, Committee on national Statistics and the Social Science Research Council, National Academy Press, Washington, DC23-24 Singh, A. C. (2002, 2006). Method for statistical disclosure limitation. US Patent Application Pub. No. US 2004/0049517A1: Patent granted June 2006. Patent no. US7058638B2. 19