Implement differential privacy to power up data sharing and cooperation

This is a partial article. If you’d like to read the full blog, head on over to our friends at TechCrunch

Traditionally, companies have relied upon data masking, sometimes called de-identification, to protect data privacy. The basic idea is to remove all personally identifiable information (PII) from each record. However, a number of high-profile incidents have shown that even supposedly de-identified data can leak consumer privacy.

In 1996, an MIT researcher identified the then-governor of Massachusetts’ health records in a supposedly masked dataset by matching health records with public voter registration data. In 2006, UT Austin researchers re-identifed movies watched by thousands of individuals in a supposedly anonymous dataset that Netflix had made public by combining it with data from IMDB.

In a 2022 Nature article, researchers used AI to fingerprint and re-identify more than half of the mobile phone records in a supposedly anonymous dataset. These examples all highlight how “side” information can be leveraged by attackers to re-identify supposedly masked data.

These failures led to differential privacy. Instead of sharing data, companies would share data processing results combined with random noise. The noise level is set so that the output does not tell a would-be attacker anything statistically significant about a target: The same output could have come from a database with the target or from the exact same database but without the target. The shared data processing results do not disclose information about anybody, hence preserving privacy for everybody.

Operationalizing differential privacy was a significant challenge in the early days. The first applications were primarily the provenance of organizations with large data science and engineering teams like Apple, Google or Microsoft. As the technology becomes more mature and its cost decreases, how can all organizations with modern data infrastructures leverage differential privacy in real-life applications?

Differential privacy applies to both aggregates and row-level data

When the analyst cannot access the data, it is common to use differential privacy to produce differentially private aggregates. The sensitive data is accessible through an API that only outputs privacy-preserving noisy results. This API may perform aggregations on the whole dataset, from simple SQL queries to complex machine learning training tasks.

A typical set-up for leveraging personal data with differential privacy guarantees

A typical setup for leveraging personal data with differential privacy guarantees. Image Credits: Sarus
Want to keep reading? Finish the article on TechCrunch!

 

More about the author

michael-li

Tianhui Michael Li 

2XFounder & @Google DataScientist. Fmr @a16z @NASA. @Princeton PhD. Columnist at @TechCrunch @WSJ @HarvardBiz

Follow Michael on Twitter  |  Connect on LinkedIn

Related Blog Posts

Moving From Mechanical Engineering to Data Science

Moving From Mechanical Engineering to Data Science

Mechanical engineering and data science may appear vastly different on the surface. Mechanical engineers create physical machines, while data scientists deal with abstract concepts like algorithms and machine learning. Nonetheless, transitioning from mechanical engineering to data science is a feasible path, as explained in this blog.

Read More »
Data Engineering Project

What Does a Data Engineering Project Look Like?

It’s time to talk about the different data engineering projects you might work on as you enter the exciting world of data. You can add these projects to your portfolio and show the best ones to future employers. Remember, the world’s most successful engineers all started where you are now.

Read More »
open ai

AI Prompt Examples for Data Scientists to Use in 2023

Artificial intelligence (AI) isn’t going to steal your data scientist job! Instead, AI tools like ChatGPT can automate some of the more mundane tasks in your future career, saving you time and energy. To make life easier, here are some data science prompts to get you started.

Read More »