Congress should mandate an unprecedented corporate data-sharing program to enable outside, independent researchers to conduct the kinds of analysis on social media platforms that firm insiders routinely perform.
Area of Study
This article was originally published at Brookings.
We appear to have reached an inflection point when it comes to concern about the harms of social media and the willingness of governments to do something about them. The recent revelations by Facebook Whistleblower Frances Haugen have set off alarms around the world concerning everything from Instagram’s effect on teen mental health to Facebook’s responsibility for political violence. The revelations were explosive in their own right, but the reaction to them demonstrates how little outsiders know about what is happening inside these companies. The furor that has followed in the wake of these unprecedented disclosures makes it clear that outsider access to the data held by social media companies represents the critical first step to understanding the effect of these platforms on society and how regulation should be crafted to prevent harm.
When social media platforms first launched nearly two decades ago, they promised to bring people together and give the average person a megaphone to speak to the world. Social media enabled millions, and then billions, of people to connect and form communities online. They allowed masses of people to organize protest movements against the powerful, from the Arab Spring to Occupy Wall Street. For the academic research community, these new platforms—which automatically generated a digital record of their users’ behaviors—promised untold new opportunities to observe human behavior and attitudes, and, combined with technological advances around machine learning and AI, the opportunity for seemingly unlimited advancement of scientific knowledge.
This narrative in the United States shifted about five years ago, however, with growing awareness of the prevalence of mis- and dis-information online, as well as revelations of Russian attempts to use social media to manipulate the 2016 U.S. election and incite social conflict. In the wake of the 2016 election, the media and Congress investigated how Russian actors exploited the affordances of social media platforms—everything from their advertising interfaces to closed groups to normal organic posts—to attack candidate Hillary Clinton in the U.S. presidential election and to propagate divisive messages concerning topics such as immigration, Muslim rights, religion, and gun policy. In addition, profit-oriented groups from Macedonia to California found ways to make money on these platforms by spreading disinformation and division. The pathologies attributed to social media have only multiplied since 2016, as Facebook, Twitter, and YouTube have been blamed for everything from polarization to COVID disinformation to anorexia to genocide.
As the earlier utopian prediction for social media turned decidedly pessimistic, research on these new technologies developed into a field of its own. However, because the platforms tightly controlled the data necessary to study these phenomena, academic researchers were limited in their efforts to get a handle on the scale, character, and causes of the various phenomena attributed to the rise of social media. To generate this broad new literature, researchers turned to surveys, experiments, browser plugins, scraping, and a host of other new methods to try to glimpse from the outside what firm insiders could easily see on the inside.
As difficult as it was before 2016, the Cambridge Analytica scandal further chilled any platform efforts to make it easy for outside researchers to gain access to individual-level content. That scandal involved a university researcher operating in his individual capacity harvesting “friend” data on Facebook and turning it over to a political consulting firm during the 2016 U.S. election campaign. As a result of the implications of these actions for user privacy, Facebook eventually paid a $5 billion fine to the Federal Trade Commission and shut down some of the APIs (automated programmer interfaces, or tools for efficiently downloading data directly without having to first render webpages in order to so) that academics had used for research. For any subsequent data-sharing effort at Facebook or another platform, the Cambridge Analytica scandal looms large and has created a presumption against open researcher access that might be exploited by bad actors or lead to leakage of private user data.
If the 2016 election and the Cambridge Analytica scandal represented a turning point in public concern about social media, the 2021 Haugen revelations seem to have focused legislative attention in the United States and around the world. Haugen has testified before parliamentary committees in several jurisdictions. Legislators in the U.S. and around the world have followed with additional hearings and various proposals to regulate social media algorithms, to remove the platform’s legal immunity for user generated content, to break up Facebook and Google, and to regulate advertising. Unfortunately, there remains a real risk that legislation, particularly as it relates to content moderation, will be based on the snippets of data and research found in the recent document disclosures. To fill the void, Congress should mandate an unprecedented corporate data-sharing program to enable outside, independent researchers to conduct the kinds of analysis on social media platforms that firm insiders routinely perform.
The current barriers to social media research
Given the tremendous public interest in understanding social media’s impact on the quality of American democracy, it is important to note that unlike the administrative (e.g., election results, economic indicators) or self-created (e.g., surveys, lab experiments) data that social scientists mined to understand political phenomena in the pre-internet age, some of the most important data related to political behavior is now locked up in a few large internet companies. As a result, there may be more politically relevant data than ever before, but a smaller share of it is now accessible to outside researchers. Researchers have deployed creative methods from the outside, but nothing can substitute for access to the raw data held by the firms themselves.
Researchers have tried for the last decade to get access to data from Facebook, Google/YouTube, and Twitter. Twitter has been the most open of the big three, in part because analogous privacy concerns do not arise on a platform where most tweets are public. Facebook has experimented with several notable data sharing programs. Google/YouTube and TikTok have tended to be the most closed off to independent research. All of these firms will often bring in researchers for one or another project, sometimes even resulting in publication. But when it comes to truly independent research, academics tend to be forced to come up with efforts from the outside.
The system of APIs set up by Twitter over the past decade—most of which were set up for business purposes as opposed to research—led to a flowering of academic research using Twitter data. And to be very clear, Twitter deserves kudos for making so much data available, including specialized collections around Russian Internet Research Agency (IRA) trolls and COVID-19. However, even the data that Twitter makes accessible to outside researchers has left out information that is crucial for academic research, such as data about which users have seen which tweets, known as “exposure data,” or even by how many users have seen each tweet. Therefore, understanding which tweets and associated news items reach which classes of people remains an area of inquiry that outsiders to Twitter cannot investigate. Moreover, some data, such as friends and follower networks, have become harder to collect at scale over time as new API access rules are rolled out with new rate limits governing how often researchers can access the API. Nevertheless, because more research has been performed on Twitter than any other platform, our understanding of the relationship of social media to online harms is highly biased toward what is occurring on that particular platform.
Facebook has had grand ambitions around sharing data. Both authors were involved with Social Science One, an effort to share Facebook data with academics in a secure, privacy-protected way. Privacy concerns derailed that effort from its inception, as Facebook later decided that it could only release a dataset at the “URL level,” meaning no individual-level data would be accessible to academics, only information about exposure and engagement with URLs. Even at high levels of aggregation, Facebook still added noise to the data through methods of differential privacy that made the dataset difficult to use, and serious omissions were later found in that dataset. The dataset still exists as one of the largest ever made accessible to social scientists, but researchers have been slow to use it.
In the wake of the Social Science One difficulties, Facebook has embarked on a different model for a specific research project. To study the platform’s impact on the 2020 election, a team of academics (co-led by one of the authors—Tucker) has worked with Facebook researchers to analyze data related to the 2020 U.S. election. The partnership promises to make possible for the 2020 election the kind of research that was never done with respect to the 2016 election. Facebook has invested a significant amount of money and time from internal research teams in the project. Regardless of the importance and value of this particular effort, the model—while replicable for other focused studies—is by design not something that is scalable to meet the ongoing needs of the larger research community.
In contrast to these efforts, Google/YouTube and TikTok have not embarked on any dedicated data sharing program with academics. This has not stopped academics from attempting to study the workings of these platforms by, for example, seeing how and when links to YouTube videos spread on other platforms or by running experimental studies of what videos are recommended by the YouTube algorithm. YouTube does have an API that provides some information about videos that can be useful for research, although API limits can hinder the ability to conduct such research at scale. Google’s trends feature also allows the public to get a sense of trends in searches, such as how often people search for a particular candidate’s name, for example, but there is no analogous tool for YouTube.
In the end, independent academic researchers remain reliant on the kindness of platforms to make data available. Even access to supposedly public YouTube or Twitter data can be upended by changes in rules regarding what data is available through APIs, how much data researchers can collect via APIs and how fast they can do so, and whether existing APIs will be shut down (or “deprecated”) . The problem, of course, is that platforms may feel that they have little incentive to share data with academic researchers. Merely making the data available could expose them to liability for violating applicable privacy rules. Moreover, independent publications based off of such research, in some instances, will put the platforms in a bad light. Therefore, the public good of greater access is usually not seen as outweighing the real legal risk of another Cambridge Analytica or the reputational risk of embarrassing findings.
The path forward for researcher access to platform data
To break through the logjam, we need federal legislation. That legislation, such as the Platform Transparency and Accountability Act proposed by one of the authors (Persily), could come in many forms but it should have three essential characteristics. First, a federal agency must be empowered to force the large internet platforms to share data akin to what firm insiders are able to access with outsiders not selected by the firm. Second, that agency (perhaps working with a nongovernmental organization or another arm of the federal government, such as the National Science Foundation) should vet researchers and research projects that will be given access to platform data. Third, data should reside at the firm, and regulations should specify in detail the process for accessing data and publishing results in a way that would not endanger user privacy.
Apart from these three critical features, there are several different paths that legislation and regulation could take. The relevant enforcement body could be the Federal Trade Commission, given that it has been out front in dealing both with fraud and user privacy, or a wholly new government agency. The researchers could be limited to academics, or they could be expanded to other groups such as journalists or think tanks if those groups could be adequately defined by law. The universe of firms could be limited to the largest social media companies, such as Facebook, Google/YouTube, Twitter, and TikTok. Perhaps it could also be extended to other large technology companies, such as Amazon or Apple, or other critical companies in the internet stack. The type of research enabled by this law could be defined by its purposes (such as politics or mental health) or it could be expanded to all possible scientific questions that could be answerable with access to firm data. Finally, penalties both for non-compliant firms and researchers engaging in malfeasance could be significant. A platform’s immunity under Section 230 of the Communications Decency Act could depend on providing researcher access, for example. Researchers, too, and the universities with which they are affiliated could be subject to extensive fines or other criminal punishment if they attempt to repeat another Cambridge Analytica.
If outsiders get access to firm data, it will have immediate effects on platform behavior and long-term effects on informing governmental policy. The mere fact that outsiders will have data access will lead the platforms to know that they are being watched. Like any person or institution that knows they cannot operate in secret, the platforms will know that their algorithms and content moderation policies will be perpetually under scrutiny. The resulting research will not only keep them in check but also help inform their interventions and policies going forward. Ideally, the emergence of a more open research ecosystem around platform data will also encourage platforms to share more of their internal research (e.g., the materials described in the Facebook Papers) publicly, as the idea that internal research must be kept private would become less appealing. At the same time, research conducted by employees of the platforms might come to be seen as more credible by journalists, the scientific community, and policy makers in an environment where replication of such research is automatically possible by researchers who are not employees of the platform.
Independent research on platform data is a prerequisite to sound government policy. On the table right now are any number of legislative proposals dealing with privacy, antitrust, child welfare, and amendments to section 230 of the Communications Decency Act to make the platforms liable for user-generated content. For the most part, legislators are legislating in the dark—with faint light being cast by whistleblowers or well-spun public reports from the firms. Whether the issue is the harm of Instagram use on teen girls’ health or the ubiquity of hate speech and disinformation, the conventional wisdom that has led to promotion of different policy interventions can only be evaluated with access to internal data. Likewise, outside research might also improve platform policies and even the products themselves. Outside researchers will ask questions that those tied to the profit-making mission of the company may not want to ask. But the results could help platforms better understand online harms and develop more targeted policies to address them.
Finally, even apart from knowledge gained about online harms—such as misinformation, hate speech, and online harassment—and platform policies, analysis of platform data is critical to understanding larger social and policy questions, such as the nature of the impact of social media on the quality of democracy or the impact of the platforms on mental health and wellbeing. More and more of the human experience is taking place online. To understand fundamental aspects of the economy, politics, and society requires a better understanding of online behavior. Whether the topic is the effectiveness of COVID-19 interventions or the racial and gender biases of online marketplaces or the changing nature of the news media, platform data represents an ever-growing share of the data necessary to understand social phenomena and craft appropriate public policy responses.
Conclusion: Research access and the transparency agenda
Researcher access is only one component of a larger transparency agenda, and transparency is only one aspect of tech regulation. In addition to researcher access along a privacy-protected pathway described above, the platforms should make more information available to the public. We can envision a tiered system of transparency and data access in which the most sensitive data requires the kind of vetting and security measures described above. But other privacy-protected datasets could be made more widely accessible to outside researchers, including journalists and civil society groups. Finally, the platforms should be pushed to make publicly available tools and APIs, such as Google Trends or Crowdtangle, that will allow anyone to gain insights as to the magnitude of certain online phenomena.
Policy should also facilitate outside research efforts that do not depend on platform compliance. Sometimes the only way to check on the accuracy of platform-provided data is to deeply analyze what is publicly available. Researchers must therefore be protected when they develop “adversarial” methods to analyze platform data. In particular, we need to shield researchers from criminal and civil liability when they scrape publicly available websites.
We have reached a critical moment in the attention paid to digital communication and online harms and in the widespread recognition that answers to the relevant policy questions cannot be assessed without access to platform-controlled data. The current equilibrium is unsustainable. We cannot live in a world where the platforms know everything about us and we know next to nothing about them. We should not need to wait for whistleblowers to whistle before we can begin to understand all that is happening online.
Persily is the James B. McClatchy Professor of Law at Stanford Law School and co-Director of the Stanford Cyber Policy Center. Tucker is Professor of Politics at New York University and the co-Director of NYU’s Center for Social Media and Politics. They are the co-editors of Social Media and Democracy: The State of the Field and Prospects for Reform (Cambridge University Press, 2020).
Tucker, Joshua A., Yannis Theocharis, Margaret E. Roberts, and Pablo Barberá. "From liberation to turmoil: Social media and democracy." Journal of Democracy 28, no. 4 (2017): 46-59.
Reddit, another platform on which posts are almost entirely public, has also been largely open to academic analysis thanks to the work of Jason Baumgartner and the pushshift.io website he set up to share Reddit data.
The “sometimes” nature of publications highlights another problem revealed by the Facebook Papers, which is that most research conducted internally by the platforms will only make its way into the public domain if the platforms choose to release the research publicly. In academia, this is known as the “file drawer” problem, where less interesting (and often null) results fail to be published, and as a consequence the overall accumulation of knowledge is biased (see Franco, Annie, Neil Malhotra, and Gabor Simonovits. "Publication bias in the social sciences: Unlocking the file drawer." Science 345, no. 6203 (2014): 1502-1505). When we consider this from the perspective of for-profit corporations, the net result can be even more pernicious, which is that the overall accumulation of knowledge would likely be biased in the direction of research that puts the platforms in a better light. However, knowing the potential for such biases to exist should lead outside observers to discount such research accordingly, making knowledge accumulation that much more difficult. The exception here – and a possible path forward – are mechanisms by which the platforms can bind themselves a priori to share research ex post; we discuss one such mechanism below.
Tucker, Joshua Aaron, Andrew Guess, Pablo Barbera, Cristian Vaccari, Alexandra Siegel, Sergey Sanovich, Denis Stukal, and Brendan Nyhan, “Social Media, Political Polarization, and Political Disinformation: A Review of the Scientific Literature” (March 19, 2018). Available at SSRN: https://ssrn.com/abstract=3144139.
As of the time of this writing, Facebook has begun testing a new Research API with select groups of researchers that promises to make data available from “four buckets of real-time Facebook data: pages, groups, events and posts” although as of now the data is limited to public posts and to posts from the U.S. and EU (https://techcrunch.com/2021/11/15/facebooks-researcher-api-meta-academic-research/). As the API has not officially been launched yet, we hold off on any further commentary for now.
The idea of the government vetting researchers for access to sensitive administrative data is of course not new; one example is the process known as “Special Sword Status” by which researchers can be certified to work with certain data at the U.S. Census Bureau that is not available to the general public.
One of the reasons the Platform Transparency and Accountability Act is limited to academic researchers is that a university is an easily defined entity. They also have Institutional Review Boards (IRBs) that evaluate the impact of proposed research on human subjects. IRB approval is a necessary predicate for applying to get access to firm data under that proposal.
Braghieri, Luca and Levy, Roee and Makarin, Alexey, Social Media and Mental Health (August 12, 2021). Available at SSRN: https://ssrn.com/abstract=3919760