How does LinkedIn allow its data scientists to access aggregate user data for exploratory analytics while maintaining its users’ privacy? That was the question at the heart of our recent conversation with Ryan Rogers, a senior software engineer in data science at the company.
Subscribe: iTunes / Google Play / Spotify / RSS
The answer, it turns out, is through differential privacy, a topic we’ve covered here on the show quite extensively over the years. Differential privacy is a system for publicly sharing information about a dataset by describing patterns of groups within the dataset, the catch is you have to do this without revealing information about individuals in the dataset (privacy).
Ryan currently applies differential privacy at LinkedIn, but he has worked in the field, and on the related topic of federated learning, for quite some time. He was introduced to the subject as a PhD student at the University of Pennsylvania, where he worked closely with Aaron Roth, who we had the pleasure of interviewing back in 2018.
Ryan later worked at Apple, where he focused on the local model of differential privacy, meaning differential privacy is performed on individual users’ local devices before being collected for analysis. (Apple uses this, for example, to better understand our favorite emojis 🤯 👍👏).
Not surprisingly, they do things a bit differently at LinkedIn. They utilize a central model, where the user’s actual data is stored in a central database, with differential privacy applied before the data is made available for analysis.
(Another interesting use case that Ryan mentioned in the interview: the U.S. Census Bureau has announced plans to publish 2020 census data using differential privacy.)
Ryan recently put together a research paper with his LinkedIn colleague, David Durfee, that they presented as a spotlight talk at NeurIPS in Vancouver. The title of the paper is a bit daunting, but we break it down in the interview. You can check out the paper here: Practical Differentially Private Top-k Selection with Pay-what-you-get Composition.
There are two major components to the paper. First, they wanted to offer practical algorithms that you can layer on top of existing systems to achieve differential privacy for a very common type of query: the “Top-k” query, which means helping answer questions like “what are the top 10 articles that members are engaging with across LinkedIn?” Secondly, because privacy is reduced when users are allowed to make multiple queries of a differentially private system, Ryan’s team developed an innovative way to ensure that their systems accurately account for the information the system returns to users over the course of a session. It’s called Pay-what-you-get Composition.
One of the big innovations of the paper is discovering the connection between a common algorithm for implementing differential privacy, the exponential mechanism, and Gumbel noise, which is commonly used in machine learning.
One of the really nice connections that we made in our paper was that actually the exponential mechanism can be implemented by adding something called Gumbel noise, rather than Laplace noise. Gumbel noise actually pops up in machine learning. It’s something that you would do to report the category that has the highest weight, [using what is] called the Gumbel Max Noise Trick. It turned out that we could use that with the exponential mechanism to get a differentially private algorithm. […] Typically, to solve top-k, you would use the exponential mechanism k different times —you can now do this in one shot by just adding Gumbel noise to [existing algorithms] and report the k values that are in the the top […]which made it a lot more efficient and practical.
When asked what he was most excited about for the future of differential privacy Ryan cited the progress in open source projects.
This is the future of private data analytics. It’s really important to be transparent with how you’re doing things, otherwise if you’re just touting that you’re private and you’re not revealing what it is, then is it really private?
He pointed out the open-source collaboration between Microsoft and Harvard’s Institute for Quantitative Social Sciences. The project aims to create an open-source platform that allows researchers to share datasets containing personal information while preserving the privacy of individuals. Ryan expects such efforts to bring more people to the field, encouraging applications of differential privacy that work in practice and at scale.
Listen to the interview with Ryan to get the full scope! And if you want to go deeper into differential privacy check out our series of interviews on the topic from 2018.
Thanks to LinkedIn for sponsoring today’s show! LinkedIn Engineering solves complex problems at scale to create economic opportunity for every member of the global workforce. AI and ML are integral aspects of almost every product the company builds for its members and customers. LinkedIn’s highly structured dataset gives their data scientists and researchers the ability to conduct applied research to improve member experiences. To learn more about the work of LinkedIn Engineering, please visit engineering.linkedin.com/blog.
Connect with Ryan!
- Paper: Practical Differentially Private Top-k Selection with Pay-what-you-get Composition – [Slides](https://neurips.cc/media/Slides/nips/2019/westexhibitionhallc+b3(12-15-50)-12-16-50-15897-practical_diffe.pdf)
- #132 – Differential Privacy Theory in Practice with Aaron Roth
- TWIML Presents: Differential Privacy
- Deploying Differential Privacy for the 2020 Census of Population and Housing
- LinkedIn on the TWIML AI Podcast
- TWIML Presents: NeurIPS
- Join the TWIML Community!
- Check out our TWIML Presents: series page!
- Register for the TWIML Newsletter
- Check out the official TWIMLcon:AI Platform video packages here!
- Download our latest eBook, The Definitive Guide to AI Platforms!
“More On That Later” by Lee Rosevere licensed under CC By 4.0