Junk Science Week: My expert analysis of the Facebook data of Cambridge Analytica isвЂ¦ itвЂ™s useless
After the 2016 U.S. presidential election, a company called Cambridge Analytica announced that its data-driven campaign had been instrumental in Donald TrumpвЂ™ s victory. The front page of the companyвЂ™ s website featured a montage of news clips showing the story of how it had used targeted online marketing and micro-level polling data to influence voters.
Cambridge Analytica (CA) claimed to have collected hundreds of millions of data points about large numbers of U.S. voters. CA also claimed it could use this data to provide a picture of the voters вЂ™ personalities that went beyond the traditional demographics of gender, age and income.
ItвЂ™s a scary thought. FacebookвЂ™s data can be used to reveal our preferences. With such data, the candidate in a political campaign might focus on discrediting journalists and news agencies. Tailored messages would be delivered direct to individuals, providing them with propaganda that conformed to their already established world view.
So I decided to find out for myself how an approach to winning elections based on political personalities could work. Before we get carried away with the idea of right-wing politicians tapping into a 100-dimensional representation of AmericaвЂ™s voters, we need to think about how accurately the dimensions inside a computer really represent us as people.
Well-designed algorithms work with rankings and probabilities. The Facebook personality model assigns an extrovert/introvert ranking to each user or gives the probability of a user being вЂњsingleвЂќ or вЂњin a relationship.вЂќ These models take a range of factors and produce a single number that is proportional to the probability of a particular fact being true about the person.
The most basic method used for converting large numbers of dimensions to a probability, or ranking, is known as regression.
Statisticians have used regression models for over a century, with applications starting in biology and expanding to economics, the insurance industry, political science and sociology.
Cambridge Analytica and other modern data analytics companies use more or less the same statistical techniques as were used in the 1980s. The major difference between now and then is the data they have access to. It is possible to feed Facebook вЂњlikesвЂќ, answers to online poll questions, and data on the purchases we make into regression models. Instead of relying on just age, class and gender to characterize us.
Cambridge Analytica claimed to use these large data sets to establish an overall view of our personality and political standpoint.
In the past, when political scientists studied voters вЂ™ party preferences, they typically relied on socio-economic background. Cambridge claimed to вЂ?вЂќtake into account the behavioural conditioning of each individual [voter] to create informed forecasts of future behaviour.вЂќ
To do such large-scale regression on our political personalities, Cambridge Analytica needed a lot of data. In 2014, psychologist Alex Kogan, a researcher at Cambridge University, was collecting data for his scientific studies through an online crowd-sourcing marketplace called Mechanical Turk. Mechanical Turk had found that it was, at that time, surprisingly easy for researchers to access data on the social network site. Eighty per cent of people volunteering for KoganвЂ™s study provided access to their profile and their friends вЂ™ location. On average, each volunteer had 353 friends. With just 857 participants, Alex and his co-workers gained access to a total of 287,739 peoples data.
As we now know, Kogan ended up using the technique to collect data for Cambridge Analytica. With a questionnaire from 200,000 U.S. citizens, Cambridge ended up with data for over 30 million people. This was a massive dataset that potentially gave a comprehensive picture of the political personality of many Americans.
Alexander Nix, CEO of Cambridge Analytica, talked about how, instead of targeting people on the basis of race, gender or socio-economic background, his company could вЂ?вЂќpredict the personality of every single adult in the United States of America.вЂќ Highly neurotic and conscientious voters could be targeted with the message that the вЂњsecond amendment was an insurance policy. вЂ™ Traditional, agreeable voters might be told about how вЂњthe right to bear arms was important to hand down from father to son.вЂќ He claimed that he could use вЂњhundreds and thousands of individual data points on our target audiences to understand exactly which messages are going to appeal to which audiencesвЂќ and implied that the methods he had described were being used by the Trump campaign.
But when I focused on the details of the models used to predict voting patterns, I felt that one important ingredient was missing: the algorithm. I wanted to work out for myself whether NixвЂ™s big claims could really hold up to scrutiny. I conducted my own Facebook data experiment.
The accuracy of a regression model based on Facebook data is very good. In eight out of nine attempts, the regression correctly identifies the political views of the Facebook user. The main group of likes that identify a Democrat were for Barack and Michelle Obama, National Public Radio, TED Talks, Harry Potter, the I Fucking Love Science webpage and liberal current affairs shows like The Colbert Report and The Daily Show. Republicans like George W. Bush, the Bible, country and western music, and camping. It isnвЂ™t too surprising that Democrats like the Obamas and The Colbert Report or that many Republicans like George W. Bush and the Bible.
So I tried to see if I could break the regression model by taking some of the obvious вЂњlikesвЂќ out of the model and performing a new regression. To my surprise, the model still worked with 85 per cent accuracy, only a slight reduction in performance. Now it used combinations of likes to determine political affiliations. For example, someone who liked Lady Gaga, Starbucks and country music was more likely to be a Republican, but a Lady Gaga fan who also liked Alicia Keys and Harry Potter was more likely to be a Democrat.
This type of information could be very useful to a political party. Instead of Democrats focusing a campaign purely around traditional liberal media, they could focus on getting the vote out among Harry Potter fans. Republicans could target people who drink Starbucks coffee and people who go camping. Lady Gaga fans should be treated with caution by both sides. Although it is difficult to make a direct comparison, the accuracy of a Facebook-based regression model seems to beat traditional methods.
So far so good for Alexander Nix and Cambridge Analytica. But before we get carried away, letвЂ™s look a bit more closely at the limitations. First of all, there is a fundamental limitation of regression models. We canвЂ™t expect a model to reveal your political views with 100 per cent certainty. There is no way that Cambridge Analytica, or anyone else for that matter, can look at your Facebook data and draw conclusions with guaranteed accuracy.
While regression models work very well for hardcore Democrats and Republicans вЂ” as I established earlier, the accuracy is around 85 per cent вЂ” predictions about these voters are not particularly useful in a political campaign. Known party supportersвЂ™ votes are more or less guaranteed, and they donвЂ™t need to be targeted. In fact, the regression model I fitted to Facebook data does not reveal anything about the 76 per cent of people who didnвЂ™t register their political allegiance.
While the data shows us that Democrats tend to like Harry Potter, it doesnвЂ™t necessarily tell us that other Harry Potter fans like the Democrats. This is the classic problem inherent to all statistical analyses of potentially confusing correlation with causation.
When I told Alex Kogan about my findings he started to open up. Kogan had reached similar conclusions. He didnвЂ™t believe that Cambridge Analytica, or anyone else, could produce an algorithm that effectively classified peopleвЂ™s personality. He was blunt about Alexander Nix. вЂњNix is trying to promote (the personality algorithm) because he has a strong financial incentive to tell a story about how Cambridge Analytica have a secret weapon. вЂ™
There is an important distinction to be made here between a scientific finding вЂ” that a certain set of вЂњlikesвЂќ on Facebook is related to the outcome of personality tests вЂ” and the implementation of a reliable algorithm based on this finding, creating an equation that correctly predicts what type of person you are. A scientific finding can be true and interesting, but unless the relationship is very strong (which it isnвЂ™t in the case of personality prediction) it doesnвЂ™t allow us to make particularly reliable predictions about an individualвЂ™s behaviour.
In other words, the science is interesting, but there is no evidence yet that Facebook can determine and target your political personality.
We live in an exciting time, where we can use data to help us make better decisions and keep people informed about the issues that are important to them. But with this power comes the responsibility to carefully explain what we can and canвЂ™t do. It seems we have left this important job in the hands of industry consultants who are teaching data scientists how to spin their research to the greatest possible effect.
The Cambridge Analytica story is in my view primarily one about hyperbole. It is a story about a company seemingly exaggerating what they can do with data.
While whistleblower Christopher Wylie claimed that he and Alex Kogan had helped CA build a вЂњpsychological warfareвЂќ tool, the details of the effectiveness of this weapon itself was not revealed. The lack of a smoking gun squared with my own analysis, and with Alex KoganвЂ™s assessment вЂ” Facebook data is not yet sufficiently detailed to enable a suitable analysis to allow the building of adverts targeted to peopleвЂ™s individual personalities, let alone their political personalities.
Published by Bloomsbury. David Sumpter is professor of applied mathematics at the University of Uppsala, Sweden.