top of page

The False Promise of Anonymization: Why Your Personal Data Might Still Be at Risk

Updated: May 17, 2023

Did you know that 87% of Americans can be identified with just their ZIP code, birth date, and sex?

Even with anonymization techniques in place, sensitive data can still be vulnerable to re-identification attacks. It is important to be aware of the limitations of anonymization and consider more advanced privacy-preserving methods like Differential Privacy to protect individuals' personal information.

Anonymization removes personal identifiers from a dataset, such as names, addresses, and social security numbers, and then manipulates the data to hide individuals' identities. The goal is to protect the privacy of individuals whose data is included in the dataset while allowing researchers and analysts to access and use the data for various purposes. However, anonymization techniques have several limitations that can compromise individuals' privacy and render them vulnerable to re-identification attacks.

One limitation of anonymization techniques is that they cannot fully protect against linkage attacks. Linkage attacks combine anonymized data with publicly available data sources to identify individuals. For example, in the Netflix Prize Dataset (1), released in 2006, researchers could link an anonymized dataset of movie ratings to individual users by using other publicly available data sources such as the Internet Movie Database (IMDb). Similarly, researchers could link an anonymized dataset of search queries made by AOL users to individual users by using information such as ZIP codes and ages.(2)

Another example of a linkage attack involves the Massachusetts Group Insurance Commission (GIC) healthcare claims dataset. In 2016, the GIC released an anonymized dataset of healthcare claims for state employees and their dependents. However, Latanya Sweeney, then an MIT graduate student, identified the healthcare records of the governor of Massachusetts, William Weld, and his wife by cross-referencing the released dataset with publicly available information about their healthcare providers and appointment dates. (3)

The New York City Taxi Dataset is another example of a dataset that was anonymized but still vulnerable to linkage attacks. In 2014, the New York City Taxi and Limousine Commission released an anonymized dataset of over one billion taxi rides taken by their drivers over a period of several years (4). Although the dataset was anonymized by removing drivers' names and license numbers, researchers could still link the data to individual drivers by using GPS data and matching the pickup and drop-off locations to publicly available data sources such as social media profiles and news articles.

These examples highlight the limitations of anonymization techniques and the potential risks of releasing sensitive data, even if personal identifiers have been removed. They underscore the importance of considering the potential harms of releasing data, even if it's intended for research or public policy purposes and taking steps to mitigate those risks before releasing the data.

Overall, the limitations of anonymization techniques underscore the need for more advanced privacy-preserving methods. One promising approach is Differential Privacy, which adds controlled noise to datasets to protect individuals' privacy while still allowing for accurate analysis. By guaranteeing a certain level of privacy protection, differential privacy can help to mitigate the risks of re-identification attacks and linkage attacks. As we continue to grapple with the challenges of balancing data access and privacy protection, it's clear that more robust privacy-preserving techniques like differential privacy will play an increasingly important role in safeguarding individuals' personal information.


148 views0 comments


bottom of page