A Comparative Assessment of Disclosure Risk and Data Quality between MASSC and Other Statistical Disclosure Limitation Methods

Similar documents
Competing Goals of Responsive Design in a Total Survey Error Framework: Minimization of Cost, Nonresponse Rates, Bias, and Variance

The Good, The Bad, and the Ugly

The Good, The Bad, and the Ugly

Introduction and Background

CHUM: A Frame Supplementation Procedure for Address-Based Sampling

Disclosure control on censuses and surveys. Basic principles on micro data protection in Statistics Finland.

Inferring Gene-Gene Interactions and Functional Modules Beyond Standard Models

Chapter 19. Confidence Intervals for Proportions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Louisiana State University Health Science Center School of Public Health

Carry out rule-based statistical analysis

PharmaSUG 2016 Paper 36

Application of SAS in Product Testing in a Retail Business

SFJCB1 Gather and submit information that has the potential to support law enforcement objectives

123 and its supply chain Precautionary Principle or approach External initiatives Membership of associations 123

Data Quality Awareness as an Optimal Marketing Strategy: A Case Study of a Saudi Manufacturing Company. Mohammad Almotairi*

Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs

Shelf Life Determination: The PQRI Stability Shelf Life Working Group Initiative

Designing the integration of register and survey data in earning statistics

THE NEW WORKER-EMPLOYER CHARACTERISTICS DATABASE 1

A2LA. R231 Specific Requirements: Threat Agent Testing Laboratory Accreditation Program. December 6, 2017

Genetic Algorithms in Matrix Representation and Its Application in Synthetic Data

Understanding and accounting for product

Introduction to Survey Data Analysis

Quality issues in biosimilars Some thoughts

UK Clinical Aptitude Test (UKCAT) Consortium UKCAT Examination. Executive Summary Testing Interval: 1 July October 2016

Conducting a Customer Survey Part 3 of 3

Protecting Sensitive Tabular Data by Complementary Cell Suppression - Myth & Reality

Canon Supplier CSR Guidelines

COMPAL ELECTRONICS, INC. Corporate Social Responsibility Best Practice Principles

Business Information Systems. Decision Making and Problem Solving. Figure Chapters 10 & 11

What is DSC 410/510? DSC 410/510 Multivariate Statistical Methods. What is Multivariate Analysis? Computing. Some Quotes.

The AGC Group Guide For CSR Implementation Survey As of 21 Nov, 2013

Why do Gage R&Rs fail?

Using Weights in the Analysis of Survey Data

Chapter Standardization and Derivation of Scores

Dealing with Missing Data: Strategies for Beginners to Data Analysis

Chapter 4: Foundations for inference. OpenIntro Statistics, 2nd Edition

Is personal initiative training a substitute or complement to the existing human capital of. women? Results from a randomized trial in Togo

Total Resource Use Bootstrap Reliability Analysis

Credit Scoring, Response Modelling and Insurance Rating

ILO/IFC Better Work Monitoring and Evaluation. Tufts University Macalester college

Understanding Inference: Confidence Intervals II. Questions about the Assignment. Summary (From Last Class) The Problem

Low-quality, low-trust and lowadoption: Saharan Africa. Jakob Svensson IIES, Stockholm University

Minimizing Makespan for Machine Scheduling and Worker Assignment Problem in Identical Parallel Machine Models Using GA

Department of Sociology King s University College Sociology 302b: Section 570/571 Research Methodology in Empirical Sociology Winter 2006

Drug Discovery in Chemoinformatics Using Adaptive Neuro-Fuzzy Inference System

ALI-ABA Course of Study The Art and Science of Serving as a Special Master in Federal and State Courts. November 2-3, 2006 San Francisco, California

The Committee of Ministers, under the terms of Article 15.b of the Statute of the Council of Europe,

Alternative Trial Designs

CUSTOM DECISION SUPPORT, LLC Telephone (831) Pharmaceutical Pricing Research

Applied Microeconometrics I

RECORD RETENTION GUIDELINES

Adaptive Model-Based Designs in Clinical Drug Development. Vlad Dragalin Global Biostatistics and Programming Wyeth Research

Preprocessing Technique for Discrimination Prevention in Data Mining

Appendix 8. M&T BANK CORPORATION CODE OF BUSINESS CONDUCT AND ETHICS

MBA Core Curriculum Course Descriptions

Formalizing Rural Land Rights in West Africa: Results from a Randomized Impact Evaluation in Benin

OUR CODE OF BUSINESS CONDUCT AND ETHICS

ETHICAL CODE OF CONDUCT

PO 001: QHSE POLICY PO 002: TRANSPORTATION AND MOBILIZATION, ROAD SAFETY POLICY

Toolkit for Impact Evaluation of Public Credit Guarantee Schemes for SMEs (Preliminary)

AP Statistics Scope & Sequence

Chapter URL:

MODEL FORPREDICTING CONSUMER PURCHASE DECISION USING DEMOGRAPHIC VARIABLES WITH REFERENCE TO MIDSIZED CAR ABSTRACT

Recent Developments in Assessing and Mitigating Nonresponse Bias

Preface. Lester R. Frankel, Chairman CASRO Task Force on Completion Rates. Page 1 of 13

Culturally Competent Treatment Project: Length of Stay Analysis and Comparison Group Outcomes

Impact Evaluation. Some Highlights from The Toolkit For The Evaluation of Financial Capability Programs in LMIC

Decision Support and Business Intelligence Systems

Introduction to Sample Surveys

Barbara Strozzilaan 201, 1083HN Amsterdam

Examples of Statistical Methods at CMMI Levels 4 and 5

Secondary Math Margin of Error

THE NORMAL CURVE AND SAMPLES:

Kristin Gustavson * and Ingrid Borren

The Path to More Cost-Effective System Safety

GEROS Evaluation Quality Assurance Tool Version

Introduction to Survey Data Analysis. Focus of the Seminar. When analyzing survey data... Young Ik Cho, PhD. Survey Research Laboratory

Common Mistakes and How to Avoid Them

CODE OF CONDUCT. of the Greiner Group

Introduction to Research

Corporate Social Responsibility Best Practice Principles

Pegatron Corporation Corporate Social Responsibility Practice Principles

THEORIES CONT.; ETHICAL ISSUES CHAPTER 3 & 4

Sample Survey and Sampling Methods

A Study of the Impact of Synthetic Data Generation Techniques on Data Utility using the 1991 UK Samples of Anonymised Records

Logistic Regression Analysis

BTS Statistical Standards Manual October 2005

Enhanced Cost Sensitive Boosting Network for Software Defect Prediction

B. Braun Group. Code of Conduct

Preamble. FB-AU03-06-en02 Code of Conduct for social responsibility

Soil Vapor Reproducibility

Experimental Design and Statistical Methods. Workshop GENERAL OVERVIEW. Jesús Piedrafita Arilla.

Financial Crisis Inquiry Commission. Hedge Fund Market Risk Survey: Methodology Report

Cargotec Supplier Requirements

The Essential Role of Pair Matching in Cluster-Randomized Experiments, with Application to the Mexican Universal Health Insurance Evaluation

Case Studies Using Dynamic Randomisation Techniques other than Minimisation

A Comprehensive Evaluation of Regression Uncertainty and the Effect of Sample Size on the AHRI-540 Method of Compressor Performance Representation

Activity Based Operation Modeling and Events Simulation

Verisk Analytics, Inc. Code of Business Conduct and Ethics As Amended June 5, 2018

Transcription:

A Comparative Assessment of Disclosure Risk and Data Quality between MASSC and Other Statistical Disclosure Limitation Methods By Feng Yu and Neeraja Sathe RTI International is a trade name of Research Triangle Institute. www.rti.org

Objective of this paper Compare these 3 methods of Statistical Disclosure Limitation (SDL): MASSC (an RTI product) Post Randomization (PRAM) using R Random swapping using SAS 2

What is disclosure? Disclosure refers to inappropriate attribution of information on a data subject, whether an individual or an organization Disclosure occurs when: A data subject is identified on a released file (identity disclosure) Sensitive information about a data subject is revealed through the released file (attribute disclosure) It becomes possible to determine the value of some characteristic of a data subject more accurately than it would have been otherwise (inferential disclosure) 3

Examples of Identifying and Sensitive data Direct identifiers Social security numbers, addresses, names, etc. Indirect identifiers a combination of variables such as gender, race, and occupation (e.g. female, Asian, astronaut) Sensitive data substance use, criminal activity, health outcomes, income, etc. 4

Types of Intrusion Outside intrusion occurs when an intruder tries to identify a sample record by matching it to an external database without prior knowledge of who is in the sample. Inside intrusion occurs when an unauthorized person tries to link a record in a microdata file to an identifiable respondent the intruder knows in the file. 5

What is statistical disclosure limitation (SDL)? SDL are techniques applied to released statistical data that minimize or limit the potential for individual identification. Before releasing statistical tables or microdata files, federal agencies use a variety of statistical methods to protect their data and to ensure that the risk of disclosure is very small. In addition to being ethical and needed to assure adequate survey response rates, it is the law: Confidential Information Protection and Statistical Efficiency Act (CIPSEA). 6

Current common SDL Methods For tabular data Cell suppression, controlled rounding, synthetic substitution For microdata Restrict data dissemination, strip off direct identifiers, topcode or bottomcode sensitive items, collapse categories, random swapping, perturbation, generate synthetic data, etc. 7

Details on Our Research MASSC is an SDL method for treating microdata files developed at RTI. We compare MASSC with two other SDL methods by examining the degree to which the use of each affects data quality and lowers disclosure risk. The other methods are Random swapping, and Post RAndomisation Method (PRAM). 8

Method 1: Random Swapping (using SAS) Random Swapping (Dalenius and Reiss, 1978) is an SDL technique used for categorical variables. Data containing sensitive information are swapped so that it is difficult for an intruder to definitively identify any individual. Confidentiality is protected by introducing uncertainty about sensitive data values (SVs). Consistency checks ensure that logical swaps are executed. Certain statistical inferences are preserved by retaining marginal distributions. 9

Method 2: PRAM (using R, package called sdcmicro) PRAM (Gouweleeuw et al., 1998) is also an SDL technique for categorical variables. It is analogous to noise addition to values of continuous variables. PRAM when applied to a categorical variable, alters each record on that variable using a pre-selected probability mechanism. 10

Method 3: MASSC MASSC (Singh, 2002) consists of the following four major steps: Micro Agglomeration - partitions data into risk strata based on a set of selected identifying variables(ivs). Substitution - replaces IVs of the randomly selected records with those of substitution donors subject to a set of bias constraints. Subsampling randomly deletes some of the records from the data subject to a set of variance constraints. Calibration adjusts weights in the subsample to the original total weights in the full analytic file. 11

Comparing MASSC with Other Two Methods We conducted simulations based on a random sample of combined 2006 and 2007 National Survey on Drug Use and Health (NSDUH) public use files Treatment rates: 10% and 20% Simulations: 100 times per treatment rate For risk assessment, we calculated the matching rates that a record in the treated sample could be correctly linked to the corresponding record in the population For utility assessment, we compared the effects on estimated means and regression-model parameters 12

Matching Rate (%) Results - Risk Assessment Matching Rates (%) from Simulations (n=100) w.r.t. All IVs 90 80 70 60 10% Treatment Rate MASSC SWAP PRAM 50 40 30 20 10 0 Exact Match Probability Match Distance Match 13

Estimates comparisons Results - Utility Assessment Treat ment Rate Summary Statistics Estimates (n=340x100) MASSC SWAPPING PRAM Ratio_EST Ratio_SE Ratio_EST Ratio_SE Ratio_EST Ratio_SE 10% Max Min Mean 1.05 1.10 3.67 3.67 13.70 13.11 0.93 0.93 0.94 0.95 0.90 0.88 1.00 1.02 1.01 1.01 1.07 1.06 Where Ratio_EST p p i i0 and Ratio_SE SE SE i i0 14

Results - Utility Assessment (cont d) Regression comparison- Change of Significance n = 220 x 100 MASSC (Average, range) Swapping (Average, range) PRAM (Average, range) Sig. to Non-Sig. Non-Sig. to Sig. Sig. to Non-Sig. Non-Sig. to Sig. Sig. to Non-Sig. Non-Sig. to Sig. 10% Treatment Rate 3.29 (0-8) 1.99 (0-6) 3.02 (0 8) 2.13 (0 8) 6.26 (0 15) 3.26 (0 12) 15

Summary of Simulation Results All three methods provide a certain degree of confidentiality protection to the data; as the overall treatment rate increases, the matching rates decreases. With all three methods the data quality decreases as the overall perturbation rate increases. When random swapping is properly designed, it is similar to MASSC. PRAM appeared to be less appealing than the other two methods. 16

Summary of Simulation Results (cont d) MASSC has a strong theoretical background and it provides simultaneous protection on data confidentiality and data quality. MASSC tends to provide more opportunities for better disclosure treatment and the treated data quality is preserved on average. Since MASSC involves a subsampling step, the suppressed records are guaranteed to have no disclosure risk. Thus, this method is better than the others at protecting against inside intrusion. Due to the interactive features of MASSC, it needs more labor and computer time than the other two methods. 17

Future Work Develop/use other risk assessment methods to calculate disclosure risk. Compare three methods using other sets of survey data. Develop other distance functions to be used in SAS. 18

References Dalenius and Reiss,1978, Data-swapping: A technique for disclosure control (extended abstract). American Statistical Association, Proceedings of the Section on Survey Research Methods, Washington, DC, 191 194. Duncan et al. (1993). Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics, Committee on national Statistics and the Social Science Research Council, National Academy Press, Washington, DC23-24 Singh, A. C. (2002, 2006). Method for statistical disclosure limitation. US Patent Application Pub. No. US 2004/0049517A1: Patent granted June 2006. Patent no. US7058638B2. 19