Report: It’s Easy to Identify Someone from Anonymized Databases

FILE: In this file photo the Facebook Inc. logo is reflected in the eyeglasses of a user i
David Paul Morris/Bloomberg via Getty Images

A new study claims that it is quite easy to identify an individual from a database of anonymized information, even when personal details have been removed. The Silicon Valley Masters of the Universe claim that their usage and potential sale of user data to advertisers protects privacy by anonymizing the data, a claim shaken by the new study.

The MIT Technology Review reports that the most common way that public agencies and tech companies protect personal user data is through anonymization, a process which strips away any identifiable information such as names, phone numbers, and email addresses. This sort of process is applied to medical records, tax records, and a multitude of other databases. There’s only one problem — it is apparently not as effective as many think.

A new study published in Nature Communications suggests that this data is far from anonymous. Researchers from Imperial College London and the University of Louvain create a machine-learning model which estimates how easy it would be to reidentify individuals based on an anonymized data set. The researchers even developed a public tool where users can check their own score by entering their zip code, gender, and date of birth, it can be found here.

Using those three records in the United States, you could be identified in an “anonymized” database 81 percent of the time according to researchers. If given 15 demographic attributes of an individual living in Massachusetts, researchers found that there was a 99.98 percent chance of finding the individual in a supposedly anonymized databases.

Yves-Alexandre de Montjoye, a researcher at Imperial College London and one of the study’s authors, commented on the study stating: “As the information piles up, the chances it isn’t you decrease very quickly.” Researchers developed the tool by assembling 210 different data sets from five different sources including the US Census. This data was programmed into a machine-learning model which then identified which combinations are more unique and which are less so, assigning them a probability of correct identification. 

de Montjoye commented on why this is such an issue stating: “The issue is that we think when data has been anonymized it’s safe. Organizations and companies tell us it’s safe, and this proves it is not.” Charlie Cabot, the research lead at privacy engineering firm PRiviatr argues that companies should adopt a differential privacy system which uses a mathematical model that shares aggregate data about user habits between organizations while protecting an individual’s identity.

Read the full study in Nature Communications here. 

Lucas Nolan is a reporter for Breitbart News covering issues of free speech and online censorship. Follow him on Twitter @LucasNolan or email him at lnolan@breitbart.com

COMMENTS

Please let us know if you're having issues with commenting.