Exploring Machine Learning Privacy

We recently ran a series of shows on differential privacy on the podcast. It’s an especially salient topic given the rollout of the EU’s General Data Protection Regulation (GDPR), which becomes effective this month, not to mention scandals like the Facebook/Cambridge Analytica breach and other attacks on private data.

If you hadn’t previously (or haven’t yet) heard the term differential privacy, you’re not alone. The field is relatively new–only about ten years old. Differential privacy attempts to allow data holders to make confidential data available for analysis or use via a data product while simultaneously preserving–actually, guaranteeing–the privacy of the individuals whose data is included in the database or data product.

Differential privacy is often introduced in contrast to data anonymization. While anonymization might seem to be a reasonable way to protect the privacy of those data subjects whose information is included in a data product, that information is vulnerable to numerous types of attack.

Consider, for example, the Netflix Prize, a frequently cited example. In support of a competition to see if someone could build a better recommendation engine, Netflix made an anonymized movie rating dataset available to the public. A group of researchers, however, discovered a linkage attack that allowed large portions of the data to be de-anonymized by cross-referencing it with publicly available IMDB user data.

But what if we don’t want to publish data, but rather use it to create machine learning models that we allow others to query or incorporate into products? It turns out that machine learning models are vulnerable to privacy leakage as well. For example, consider a membership inference attack against a machine learning model. In this kind of attack, patterns in the model’s output are used to extract the data the model was trained on. These types of attacks, powered by ‘shadow’ machine learning models, have been shown to be effective against black-box models trained in the cloud with Google Prediction API and Amazon ML.

In another example, an attack called model inversion [pdf] was used to extract recognizable training images (i.e. faces) from cloud-based image recognition APIs. Because these APIs return a confidence score alongside the label of a face submitted for recognition, an adversary could systematically construct an input face that maximizes the APIs confidence in a given label.

Differential privacy is an approach that provides mathematically guaranteed privacy bounds–it’s not a specific algorithm. For any given problem, there can be many algorithms that provide differential privacy.

Aaron Roth provided a great example of a simple differential privacy algorithm in our interview. In his example, a polling company wants to collect data about who will vote for Trump in the upcoming election, but are concerned about the privacy of the people they poll. Roth explains that they could use a simple yet differentially private method of collecting the data. Instead of simply asking for the pollees voting data, the company could instruct the individuals to first flip a coin and if the coin is heads, answer the question honestly, but if the coin is tails, give a random answer decided by another coin flip.

Because the statistical characteristics of the coin flip are known, you can still make inferences about the wider population even though your data collection has been partially corrupted. At the same time, this method ensures that the individuals in your study are protected by plausible deniability. That is to say, if the data were to be exposed there’s no way of knowing if a given answer was honest or part of the injected noise.

Some tech companies are already starting to reap the benefits of differential privacy. For example:

Apple. Apple uses differentially private methods of capturing user data to gain insights about user behavior on a large scale. They’re currently using this method for applications as diverse as QuickType and Emoji suggestions, Lookup Hints in Notes, crashing and energy draining domains in Safari, autoplay intent in Safari, and more.
Google. In addition to using differential privacy to help understand the effectiveness of search query suggestions in its Gboard keyboard, Google, along with other cloud providers, has a huge incentive to explore these methods due to the public nature of many of the machine learning models they offer. Google has published several papers on the topic so far, including RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response and Deep Learning with Differential Privacy.
Bluecore. Bluecore offers software and services to help marketers find and retain their best customers through targeted email marketing. The company uses differential privacy techniques to pool data across companies to improve customer outcomes while preventing any individual customer from being able to gain any insights into competitors’ data. Be sure to check out my interview with Bluecore director of data science Zahi Karam.
Uber. Uber uses differential privacy to protect sensitive data against internal and external privacy risks. When company data analysts explore average trip distances in a city, for example, their queries go through an internal differential privacy system called Chorus, which rewrites their queries to ensure differential privacy.

Open-source tools for differential privacy are beginning to emerge from both academic and commercial sources. A few examples include:

Epsilon is a new differential privacy software system offered by Georgian Partners. Epsilon currently works for logistic regression and SVM models. At this time it’s only offered to the firm’s partners and portfolio companies, but the team behind that project plans to continue expanding the tool’s capabilities and availability. For more check out my interview with Georgian’s Chang Liu.
SQL Elastic Privacy is an open source tool from Uber that can be used in an analytics pipeline to determine the level of privacy required by a given SQL query. This becomes a parameter that allows them to fine-tune their differential privacy algorithm.
Diffpriv is an R package that aims to make differential privacy easy for data scientists. Diffpriv replaces theoretical sensitivity analysis with sensitivity sampling, helping to automate the creation of privacy assured statistics, models, and other structures.
ARX is a more comprehensive open-source offering comprising a GUI-based tool and a Java library implementing a variety of approaches privacy-preserving data analysis, including differential privacy.

As you might imagine, differential privacy continues to be an active research topic. According to Roth, hot research areas include the use of differential privacy to create and publish synthetic datasets, especially for medical use cases, as well as better understanding the ‘local method’ of differentially private data collection, in which the noise is injected at the time of collection as opposed to after collection.

Differential privacy isn’t a silver bullet capable of fixing all of our privacy concerns, but it’s an important emerging tool for helping organizations work with and publish sensitive data and data products in a privacy-preserving manner.

I really enjoyed producing this series and learned a ton. I’m eager to hear about what readers and listeners think about it, so please email or tweet over any comments or comment below.