Home /
Impact /
Policy /
The Case for Open Data Access to Aid Tech Regulation

The Case for Open Data Access to Aid Tech Regulation

December 17, 2024 · Policy

To really understand the potential risks and harms of social media, platforms and policymakers need to ensure accessible pathways for empirical research.

TikTok app in front of US and China flags.

This article was originally published in Brookings.

This year, Congress passed a law forcing ByteDance, TikTok’s parent company, to sell the app within a year or face a potential shutdown. However, there is speculation that the incoming Trump administration could reverse the ban, even though the first Trump administration originally raised the idea of a ban, and his potential cabinet picks remain divided on the decision.

TikTok is now one of America’s most popular social media platforms, used by one-third of U.S. adults. Moreover, both the Harris and Trump campaigns utilized TikTok during the 2024 presidential election campaign. Proponents of the law argue that the platform poses real threats to U.S. national security and consumer privacy. National security concerns stem from the unclear relationship between the Chinese government and private tech companies, such as TikTok, which collects valuable troves of data. For example, in 2018, a top Chinese media regulator forced ByteDance to shut down its app Neihan Duanzi for the “vulgar content” shared on the platform aimed at producing and exchanging jokes. This instance of China’s involvement in ByteDance raises anxieties among U.S. lawmakers when compounded by reports of the Chinese government’s history of intimidation and intervention in private company’s speech, both within and outside China. Additionally, consumer privacy concerns include allegations that TikTok allowed private data about American users to be stored and accessed in China, whether ByteDance employees can still access U.S. users’ data, and fears that the Chinese government could manipulate TikTok’s algorithm to push certain types of information—such as foreign propaganda or misinformation—to unwitting American users.

In response to these concerns, TikTok launched Project Texas, a national security agreement to address risks identified by the U.S. government. Project Texas is the creation of U.S. subsidiary, TikTok U.S. Data Security Inc (USDS), wherein Oracle—a U.S. technology company—manages the data of U.S. citizens and moderates content on the platform. Furthermore, Project Texas ensures only U.S. citizens or green card holders are employed at USDS, while also providing the government with the ability to conduct background checks on any potential USDS employee. Although this initiative aims to address national security fears, these measures have yet to satisfy lawmakers.

However, despite these considerable concerns, almost no rigorous research exists to inform these debates or provide insight into the effects that the platform may be having on the information diets of American users. The lack of evidence has led to general, speculative—though not necessarily unfounded—fears of Chinese influence over how Americans consume their information rather than specific, evidence-based critiques. With this in mind, USDS should extend data access rights to researchers to better inform debates around TikTok’s potential impact on national security or mental health with empirical evidence.

For over a decade, academic researchers have used social media data to investigate and understand the impact that social media platforms have on society. For example, at NYU’s Center for Social Media and Politics (CSMaP), social media data access helped us produce empirical research studying the reach of foreign influence campaigns, how information spreads online, and more policy-relevant topics. Nonetheless, platforms have reduced researchers’ access to this data over the past few years, leaving fewer avenues for academic researchers to study the social media landscape, as a whole, and to share large-scale empirical research with policymakers.

In this article, we discuss three potential data access pathways researchers could use to study TikTok in the U.S.—the company’s Researcher API, individual users’ data “takeouts,” and online scraping—highlighting the challenges of each approach. Taken together, this analysis aims to emphasize current challenges for researchers studying social platforms, how a lack of research can lead to under-informed policy debates, and what can be done to provide better, open data access for researchers to answer pressing questions about platforms like TikTok and for policymakers pursuing new legislation.

Direct access through the TikTok Researcher API

Let’s consider a specific national security concern around TikTok—that foreign actors might create or amplify political messaging to influence American views or sow doubt in U.S. elections. To bring empirical evidence to bear on this concern, academic research teams want to be able to directly measure the presence of “influence” on the platform, whether that comes in the form of content—as was the case with Russia’s now-well-known Internet Research Agency trolls campaign on Twitter in the 2016 U.S. election—or through algorithmic behavior, as has been suspected on TikTok. However, the options for researchers to study the topic are, of course, a function of the data to which they have access.

The first and most direct pathway to access TikTok data is from the platform itself. At present, TikTok has established a TikTok Researcher API (Application Programming Interface) program where academic researchers can apply for access to datasets with a sample of anonymized profiles, content, and engagement metrics. In trying to investigate phenomena like foreign influence across the entire digital ecosystem, researchers must contend with the current limitations of the API: whether it provides a large enough sample of data (in most cases it does not); the fact that it doesn’t differentiate between all content on the platform and the most popular, or viewed, content on the platform; and the reality that the access structure to analyze this data is misaligned with the typical academic research process.

First and foremost, the current API provides only a snapshot of the platform’s data, with a limit of 1,000 requests per day. This rate limit means that only a small portion of the content on TikTok is available for analysis. For example, a niche topic such as an emerging election conspiracy theory might not even surface in a scholar’s API collection given this relatively limited snapshot of the data. While a researcher can query based on a hashtag, a user, or a description, this requires that they already have a strong sense of what they are searching in the dataset, again making the discovery of new topics difficult. It is also unclear if the instances shown are the most popular posts on that given topic or if the API uses another method of ranking and delivering content to researchers. While the current TikTok API allows some access to data, it is far less robust than previous APIs, like the Twitter Firehose, which includes every single published Tweet. Due to these constraints of the current TikTok API, researchers are limited in the scope of data available to analyze and the breadth of possible research questions to investigate. Additionally, preliminary assessments of the API found significant deviations between the data provided by the API and the data shown on the app itself, casting doubt on the quality and completeness of the provided data. Further, a newly published audit of the TikTok Researcher API found “significant discrepancies” between the data the API returned and what was actually on the platform, further highlighting the importance of transparency and multiple pathways for accessing data.

In addition to challenges with the data itself, our team found the process of applying for access to the Researcher API to be cumbersome and misaligned with the needs of researchers. From the onset, researchers must apply with a detailed research plan. However, there’s a chicken and egg problem here: How can researchers know the details of a research plan without information about the quality, quantity, and nature of the data that is actually available for analysis? This disincentivizes researchers from undergoing lengthy institutional reviews necessary to access data before even ensuring that the datasets ultimately align with the needs of the research project, which invariably evolve over time. In contrast, the former Twitter and Reddit Researcher APIs offered credentialed research teams access to data with fewer upfront stipulations, allowing for more exploratory and interactive research processes. Adding to this misalignment between the research processes and data access, TikTok’s current Researcher API terms of service policy requires researchers to delete data at regular intervals and at the immediate conclusion of the research project. This, however, is incongruent with principles of academic research and peer review, where anonymized data should be permanently retained and accessible for replication purposes.

Ultimately, while the TikTok Researcher API has proven valuable in describing specific events or trends, such as content about the war in Gaza, it does not enable large-scale research on long-term phenomena or news ecosystems as a whole. For these types of broad research questions like, “How much foreign influence content is on TikTok?” The Researcher API in its current form falls short.

Secondary access through TikTok data takeouts

An alternative avenue for research is to use data donations. To use this approach, researchers would focus on the individual user and design studies in which they collect individual-level data from participants. Taking the same research focus of studying foreign influence operations on TikTok, one might start by asking, “Are individual users seeing and engaging with this type of content?” as opposed to the previous question of “What does this ecosystem of foreign-influenced content consist?” To answer these individual-level questions, following Institutional Review Board (IRB) approval, researchers would work with consenting research participants and walk them through the process of requesting a data file of their own activity on TikTok (often referred to as a “takeout file”) from a platform and then sharing it with the researchers for aggregate-level analysis. The process for enabling data donations is not prohibitive for platforms since many already have portability mechanisms embedded into their interfaces as mandated by Article 20 of the European Union’s General Data Protection Regulation (GDPR). Data portability provides users the legal right and technical ability to transfer their data from one digital service to themselves and/or to other digital services. Furthermore, Article 20 is another mechanism that empowers users to take control of what they can do with their data, including enhancing the option to donate it to researchers.

While the takeout file from TikTok appears to be rich with data, including a full history of a user’s watched videos and engagement, we have encountered two issues: the completeness of the data and the ability to securely get a large number of these takeouts into the hands of researchers. To start, our research teams and collaborators have found that data sometimes appears to be missing from the data takeout without clarity on when or why this may be occurring. In addition, collecting any amount of takeouts at scale poses significant logistical challenges. TikTok’s current data-takeout process requires users to tap through five screens in their settings before being able to request their data file. From there, users must wait a few days for the file to be prepared and delivered to their app. Upon delivery, which occurs without a push notification to the user’s device to remind them of the file, users then have four days to download the file or share it with our team before it expires. While no evidence suggests whether incomplete data is intentional on TikTok’s end, it is clear that the takeout system has not been designed to be inclusive of research needs or transparency.

Scaling up this multi-step, multi-day takeout process to capture multiple thousands of takeout files requires research teams to provide extensive user support for even the most technologically savvy participants to capture this file and transfer it to the research team. Anecdotally, we’ve learned that even the most straightforward takeout processes, such as those at other social media platforms including YouTube, pose challenges to users with limited digital literacy skills and can rapidly overwhelm research teams. The current complexity of TikTok’s process—with multiple possible points of failure—presents serious logistical challenges to capturing data from a diverse set of TikTok users, and this is before we even consider the cost of such an approach at scale.2

Web scraping of TikTok Data

This leaves web scraping as a final option to collect large sample sizes of TikTok data that neither the Researcher API nor the takeout method can achieve. Generally speaking, web scraping is a style of automated data collection that captures data that is rendered on a webpage or an app. One way to think of scraping is as a way to speed up the process of copying the HTML code that renders the content on webpages as opposed to doing this manually by loading pages in a browser one at a time. Scaling this up, a research team could conceivably start to answer some of these ecosystem-wide questions such as what, if any, proportion of content might be created by or promoting the agendas of foreign actors.

Given both the current limitations to accessing data directly from platforms like TikTok and the pressing questions being asked of content on these platforms, web scraping is likely to increase in frequency and scale. However, scraping may pose legal, ethical, and institutional challenges to researchers. Additionally, the technical complexity of designing data collection pipelines—and of maintaining those pipelines over time—presents additional challenges to researchers who seek to scrape data.3

How can we achieve meaningful data access for academic research?

Researchers’ experiences trying to access TikTok data to conduct pressing research on concerns of foreign influence operations highlight that current platform and legislative policies have failed to ensure a viable pathway for empirically informing political and policy debates. For example, while Project Texas aims to address national security concerns, the plan has no provision for supporting researcher data access. However, under USDS, high-quality data access through platform-provisioned APIs could be mandated. Moreover, this challenge of platform data access extends beyond just TikTok. Twitter (now X) and Reddit have both recently made it more difficult for external researchers to access data; Meta recently discontinued the popular analytic tool CrowdTangle.

But even more generally, the public interest can be best served when independent researchers can design and implement studies that allow policymakers to understand both what is happening on the platforms and its impact. The concerns around the threats to public interest research on Twitter or Reddit are also mirrored in the lack of open access in Project Texas, therefore data access pathways must take shape in various ways, both in policy and technical reforms, including government access programs, public-private partnerships, new legislative measures, and streamlining user takeout processes.

First, the government can develop its own access programs for social media data, looking to other successful data-sharing models from the Food & Drug Administration (FDA) and the National Institutes of Health (NIH) for guidance. Passed in 2007, Section 801 of the Food & Drug Administration Act (FDAA) required the FDA to ensure companies, universities, and other parties make clinical trial results available to researchers through the public website, ClinicalTrials.gov. During the process, the FDA established standards for data sharing, including mandating that parties submit precise summary data and metadata to ClinicalTrials.gov, disclosing each data element. This standard allowed the FDA to protect clinical trial data’s quality, accuracy, and useability, making it easier for researchers to access and use protected data. NIH’s Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) is another model for sharing individual, potentially sensitive individual patient data with researchers while protecting subjects from harm. The model’s tightly controlled, tier-based system provides access to data depending on the data requested, researchers, research project, and the associated privacy risks. BioLINCC requires researchers to submit “Research Plans” detailing the exact use of specific datasets, data security practices, and commitments. NIH then enforces compliance through a contract and data use agreements. The government could adopt similar mechanisms from the FDA’s database to require that social media companies share certain data that is securely stored and made accessible to researchers. Currently, each platform has its own data-sharing protocols, and a government access program could alleviate challenges for researchers and standardize data useability across platforms. Additionally, a tiered-access system like BioLINCC would allow independent researchers to access useful data while upholding user privacy, an approach supported by information privacy experts and advocates.

Second, the U.S. government can also turn to other global laws for inspiration, such as Article 40 of the European Union’s Digital Services Act (DSA), to consider a public-private approach to research access through the law’s co-regulatory mechanisms. Article 40 includes a provision granting access to the data of very large online platforms (VLOPs) and search engines (VLOSEs) to vetted researchers contributing to the “detection, identification, and understanding of systemic risks” in the European Union. The law has co-regulatory mechanisms, which is a governance structure where actions are mostly taken by various stakeholder groups under the oversight of a government body. By adopting similar mechanisms, U.S. policymakers can avoid potential First Amendment concerns around government overreach into how social media companies control speech by engaging academia and industry when publishing guidelines for data access, limiting excessive government involvement, and increasing inclusivity and transparency in the policymaking process. Taking a public-private approach will also help U.S. policymakers establish assessments, codes of conduct, and audits that are informed by a diverse set of expertise, and in turn, social media data could be presented in a format that achieves higher accessibility among non-technical and technical researchers alike.

Third, the government can make data access easier for researchers through new legislative measures. For example, Senator Chris Coons (D-DE) reintroduced the bipartisan Platform Transparency and Accountability Act (PATA) in 2023, which would create “privacy-protected, secure pathways for independent research on data held by large internet companies” facilitated by the National Science Foundation. To enforce compliance, data would be subject to privacy and security measures created by the Federal Trade Commission (FTC). Notably, the law provides a legal “safe harbor” for researchers scraping public social media data, granting researchers protection in the absence of accessible pathways to platform data. While some experts have their reservations about PATA regarding concerns around free speech and federal agencies’ capacity for rulemaking, the bill—or others like it—should continue to be top-of-mind in policy conversations.

Finally, on the platform side, design decisions can streamline the process for users to request their data takeouts. Currently, the takeout process at TikTok spans multiple days from requesting to exporting the data. Streamlining this process based on principles of uniform data portability for data takeouts or directly sending the requested takeout to users outside the app to avoid file expirations would benefit not only users interested in understanding their digital footprint but also support researchers who wish to study platform behavior at scale.

Developing effective mechanisms to share data with researchers allows the public to better understand the risks and harms associated with social media, including national security concerns over TikTok’s algorithm concerning potential Chinese influence and questions related to child safety. Democratizing access to social media data, particularly around newer platforms like TikTok, will also allow researchers to conduct rigorous scientific analysis to inform the public, press, and policymakers about the reality of what occurs online—and help policymakers make effective, evidence-based decisions on future technology regulation.

More Latest News

The Case for Open Data Access to Aid Tech Regulation

Authors

Area of Study

Tags

Direct access through the TikTok Researcher API

Secondary access through TikTok data takeouts

Web scraping of TikTok Data

How can we achieve meaningful data access for academic research?