Private Surveys for the Public Good

Size: px
Start display at page:

Download "Private Surveys for the Public Good"

Transcription

1 Private Surveys for the Public Good Bianca Pham, Emmanuel Genene, Ethan Abramson, and Gastón P Montemayor Olaizola Abstract Protecting the privacy of individuals data collected while taking a survey has always been a challenging task. To date, survey data holders remove what is referred to as Personally Identifiable Information (PII). Despite these efforts, it has been repeatedly shown possible to uniquely reidentify people in the dataset and recover all of their data. The goal of our project is to create the first ever consumerfacing application that uses a controlled amount of randomness to induce anonymity in online surveys. This means that any individual may vote on a particular poll and his/her identity and associated responses will be probabilistically protected. To accomplish this, we built on the theoretical results associated with Differential Privacy. Differential Privacy is a statistical technique that allows data holders to guarantee the privacy of those in the dataset. It works by adding a calculated amount of random noise to the results of every query on the dataset. When the result of a query on the dataset depends significantly on a very small subset of individuals in the dataset, then the amount of noise added masks their responses keeping the survey results anonymous. When the result of a query on the dataset does not depend on a small subset of responses, then the added noise becomes negligible allowing us to learn about the population as a whole. We created a free online application, where survey respondents should feel comfortable answering truthfully, which in turn allows survey creators to ask personal and/or sensitive questions to gather important insights. I. PROBLEM STATEMENT One of the most prominent problems of today is the need for guaranteed privacy. Some questions that may come to the internet user s mind include Is my data really private? ; Can my information be traced back to me? ; What is this data being used for?. These are all questions that come to mind when asked to provide sensitive data. One main application of this concern is when individuals participate in an anonymous survey. When individuals participate in this anonymous survey, one assumes that the data is private and can not be traced back to them. However, privacy is NOT guaranteed in these situations. By using machine learning techniques, de-anonymization of the data can occur by crossreferencing data sets. This issue leads to concern for many corporations who are trying to protect their users privacy in which this privacy issue can affect their reputation if these concerns are not taken care of. At the same time, users are more inclined to not provide their data to corporations, which in turn affects the way corporations can build new products, form new policies, and improve upon current ideas without this feedback data. Thus, our focus is on the application of privacy of data in survey applications. And the research question we have had in mind pertaining to this application is How can we better protect people s privacy for the benefit of the public good? How we came upon this concern of consumer data privacy is in regards to the Netflix Prize. In 2007, Netflix was providing $1,000,000 million to whoever could improve their movie recommendation algorithm. More specifically, the Netflix Prize sought to substantially improve the accuracy of predictions about how much someone is going to enjoy a movie based on their movie preferences. As part of this contest, Netflix publicly released a large dataset containing Netflix subscribers movie ratings. These movie ratings were considered anonymous by Netflix, as they removed the consumer s name and replaced it with an anonymous identifier. However, two computer scientists demonstrated that removing the names of the consumers or personally identifiable information in general, doesn t do enough for privacy. They showed that by cross-referencing movie ratings from IMDb with Netflix s anonymous movie ratings is enough to identify people. [1] If you know their name and a few records, then you can identify that person in the other (private) database. [2] Thus, de-anonymization techniques expose the danger of public reviews published by IMDb which can identify a couple of users within the Netflix private database. More generally, the research demonstrated that information that a person believes to be private and anonymous could be used to identify them in other private databases. These consumers can be identified in terms of their beliefs, their interests, and other sensitive data. Thus, this is a major concern for users that use survey applications and provide data that is necessary for certain corporations. Existing technology and current solutions in survey applications today involve survey creators enforcing their own privacy policies. Usually, survey creators or companies who provide surveys remove personally identifiable information (zip code, , birthday, etc.). However, these current solutions still leave users vulnerable to de-anonymization, using only a small amount of external information, as noted with the Netflix prize example. Differential privacy, however, can help solve this issue and protect users sensitive information. Differential privacy is essentially a promise from the data holder to the survey participant that their information will be protected. Mathematically, it protects the privacy of users by injecting a controlled amount of randomness into the results. Our goal is to design a consumer application using the concept of differential privacy to ensure that survey creators can obtain accurate insights while simultaneously ensuring that survey participants feel their responses are actually being protected. Current applications of differential privacy are designed to work for complex queries, are designed for working with extremely large data sets, are designed for working with complete pre-existing data sets, and arent designed to share results. Our application makes differential

2 privacy more accessible through an easy-to-use website. The website collects and analyzes results in a differentially private way, while keeping track of the amount of privacy. Our contribution is a free and easy-to-use survey application that allows survey creators to obtain accurate insights while simultaneously ensuring that ensuring that survey respondents feel their responses are being kept private. We want individual answers to have no impact on released results and for published results to not give any new information on any individual user. This application allows for easy collection and analysis of the survey results while also maintaining differential privacy. II. APPROACH In order to solve this problem of anonymity when collecting data about a group of individuals, we chose to develop a consumer-facing survey web application. In this application, users can create a survey and release it to the public. Individuals can then feel safe responding to this survey, as their privacy is backed by an application of differential privacy. A. Differential Privacy As shown in Dwork s 2006 scientific paper about differential privacy, a randomized function f gives ɛ-differential privacy if for all data sets D 1 and D 2 differing on at most one element, and all S range(f) P r[f(d 1 ) S] exp(ɛ) P r[f(d 2 ) S] Differential Privacy is a property of a mechanism f which is run on some dataset X to produce some output a = f(x) [3]. In our implementation, we focused on implementing the query, f, of counting the number of votes for a certain response to a survey question. The purpose of applying this Differential Privacy to this query on the dataset X is to protect the anonymity of the respondents. This privacy mechanism works by adding calculated random noise to the data. Specifically, a query is ɛ-differentially private if the amount of noise added is chosen as a function of the largest change that single participant could have on the output of the query function [3]. In our case, the largest possible change from a single participant is one single vote. [4] In our implementation, we use the Laplace Mechanism (sample distributions pictured below) to preserve differential privacy. [5] f(x µ, b) = 1 ( ) 2b exp x µ b [6] Using this Laplace Mechanism, we perturb the query result using noise drawn from the Laplace distribution. [7] B. Creating a Survey The layout for a single question in the survey creation page is as follows: Whenever a user creates survey, he/she must do the following: He/she must set an end date and time. This will determine the number of times that our service will query the data set and add random noise to the data (See Section II.C Addition of Noise for more information). After the survey ends, no more users will be able to submit a vote to that particular survey. He/she must set a privacy value, ɛ, for each of the questions created. This privacy parameter will determine the tradeoff between the accuracy of the results and the privacy of the user who votes on the question. In order to ensure the transparency between the survey creator and the respondent, the user sees the selected accuracy vs privacy tradeoff for a given question as follows: C. Addition of Noise In our web application, our data is separated into two different tables in our database. The real data holds the actual number of votes for each of the questions of the survey. The differentially private dataset holds the noisy votes, i.e. the results of the survey with the addition of the random noise. Whenever a user submits his/her votes to a survey, the database holding the real data will be updated according to what the user has voted for.

3 Four times throughout the lifetime of the survey, our algorithm checks the table with the real data and adds random noise to these results according to the theory presented in Section II.A. These noisy votes are then updated in the differentially private table. The purpose of only updating the differentially private table four times is to prevent any users from reverse engineering our noise function. This means that reloading the page with the survey results multiple times between two of the four time intervals will show the same noisy results. D. Technology used In terms of programming languages, we used node.js for our server-side JavaScript environment (i.e. our backend) and JavaScript, HTML5 and CSS3 for the frontend of our web app. In order to speed up the development process, we used Chart.js for the pie chart visualizations, which is a library built on top of HTML5 and JavaScript. Our server is hosted on Amazon s Elastic Beanstalk, which easily integrates our data storage service (AWS DynamoDB) and hosts our server automatically on a AWS EC2 Instance. Amazon s web services gives us the possibility of easily scaling our app, but also ensures that we offer a secure and reliable software product. III. MEASUREMENTS When the user fills out the survey, the survey produces graphical aggregate results as pictured below. These aggregate results are a key difference between existing survey applications and our differential privacy survey application. In the first figure, you can see the results with differential privacy and in the second, you can see the results without differential privacy or noise added. In our application, figures without differential privacy are never released to the public. These displayed results show what a difference differential privacy can make. A. Accuracy of Survey Results The accuracy of the survey results depends on the number of votes for each answer of the question asked. In the figures provided above, you can see the difference in the true count (or the number of actual votes for a survey response), and the differentially private results that we release to the public. Most users would be concerned to be the 1 voter who voted for Less than $30,000 and thus may not answer a sensitive question (due to a valid fear of being identified). However, with the results obtained from differentially private techniques (in figure 1), more noise is added to this individual vote. In this way, a viewer of the results can not identify how many people actually voted for this answer. Thus, this individual can feel safe knowing that his/her answer will be covered up by the noise from differential privacy. So, for small number of votes, more noise is added to protect the privacy of these individuals leading to some inaccuracy. In other words, survey responses that can lead to identification are drowned out in the random noise that is added to the responses. But, for a larger number of votes, less noise is added and the numbers are more accurate and close to the results without differential privacy since these individuals are already protected by the large amount of votes. Overall, the aggregate results are the same in both figures, but now the users privacy is protected and corporations can release the results. General trends in survey responses can be seen without identifying individual survey participants.

4 B. Epsilon and its Effect The privacy parameter, ɛ, is a way for survey creators to control the amount of privacy for each question on the survey. This Epsilon factor affects the amount of noise or randomness added to the results of that question. Mathematically, the lower the Epsilon factor, the more randomness is added, and thus, the more inaccurate the results are. And the higher the Epsilon factor, the less noise is added, and thus, the more accurate the results are to the actual number of votes. The usage of this tool allows for survey creators to add more noise to questions that are recognized to be more sensitive than others. Normally, most users may not participate in sensitive questions. But once the survey creator adds this privacy parameter, users can see the privacy parameter set, and may feel more comfortable about answering a sensitive question knowing their vote will be unidentified. This parameter is key to not only allowing corporations and survey makers to ask more sensitive questions to gather important data, but also protecting users data more by allowing them to view this privacy parameter for each question. This effect educates the user on how their privacy is being protected and will hopefully engage more users to answer more sensitive questions that are needed by researchers and organizations. IV. ETHICAL & PRIVACY CONSIDERATIONS A. End-to-End Privacy Note The main focus of our project is to demonstrate the utilization of differential privacy to provide a formal guarantee of the privacy preservation. As such, we omitted the implementation of end-to-end encryption. This means that our system (as of right now) is still vulnerable to some kinds of attacks. However, there has already been a wealth of knowledge on how to protect data through encryption. So, we instead focus on novel research providing insights on the use of differential privacy to protect the release of private data. B. The Public Benefits of Private Data One of the core beliefs that we had at the outset of this project is that there is utility in the aggregate information about a population. We are not alone in this belief; open data movements have begun across the nation, and there has been an increasing use of private data to improve products and services. We are bridging the gap between differential privacy and the study of administrative data with a project here at the University of Pennsylvania called Actionable Intelligence for Social Policy. The goal of the project is to link the databases of local, state, and national organizations to better understand the intermingled populations which they service. In doing this, they can work better, smarter, and faster to address their needs. Until now, the data has been insistence guarded by the individual organizations, which hold them. With the prospect of differential privacy, we can open the access to this data by giving a formal guarantee about the privacy of individuals in the dataset. V. DISCUSSION Aside from the encryption-based privacy limitations discussed in the previous section, we have created an implementation of differential privacy that offers a novel insight into its use. We have created a system that can be used by social scientists, companies, individuals, and organizations to obtain accurate and honest insights. To further this goal, there are a couple additions to the platform, which we believe would be both useful and natural extensions of our framework. A. Numerical Responses One of the most direct additions to this project would be the implementation of numerical survey response inputs. Rather than ask a multiple choice question about annual salary, you would be able to ask the respondent to input their exact salary number. When calculating results for such a question, we would first create bins for the responses, and then count the number of respondents that fall into these bins. B. Text-Based Responses Another interesting type of input to support is text-based responses. In order to accomplish this, we would first need to make some assumptions about the language structure. Luckily there has been a vast amount of work done to model and understand language. Most of these approaches uses statistics and simple queries to model the language (like Markov chains, hidden Markov models, n-grams, etc), so they can easily be implemented while preserving differential privacy. This method could allow us to generate text from our model of the language that is roughly independent of any individual s response; essentially creating one text response that is representative of the group. When implementing this, I would anticipate two major problems: 1) One would need a large training corpus, and textresponses on surveys is often short consisting of one or two paragraphs. 2) One might expect the language model to be incorrect. This could cause grammatically incorrect text, and text which loses much of the original meaning of the response. ACKNOWLEDGMENT This work would not have been possible without the advising of Aaron Roth & Andreas Haeberlen. We would also like to thank Dennis P. Culhane for his interest and work on the Actionable Intelligence for Social Policy project. REFERENCES [1] A. Narayan and V. Shmatikov, Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset). In IEEE Symposium on Security and Privacy, pages [2] Security Focus. Researchers reverse Netflix anonymization [3] C. Dwork, Differential Privacy. In Proceeedings of the 5th International Conference on Theory and Applications of Models on Computation, pages

5 [4] Math Department of University of Alabama in Huntsville. The Laplace Distribution [5] Boost C++ Libraries. Laplace Distribution. 0/libs/math/doc/sf and dist/html/math toolkit/dist/dist ref/dists/laplace dist.html, [6] C. Dwork and A. Roth, The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9: , [7] Computer Science Department of Carnegie Mellon. A Brief Tour of Differential Privacy.