Data Science Methodology
Our experts produce new methodologies to further understand how social media affects politics and democracy. From developing and deploying code, CSMaP researchers create new ways to quantify social media interactions and its effects.
Academic Research
-
Working Paper
Web Scraping for Research: Legal, Ethical, Institutional, and Scientific Considerations
Working Paper, December 2024
Scientists across disciplines often use data from the internet to conduct research, generating valuable insights about human behavior. However, as generative AI relying on massive text corpora becomes increasingly valuable, platforms have greatly restricted access to data through official channels. As a result, researchers will likely engage in more web scraping to collect data, introducing new challenges and concerns for researchers. This paper proposes a comprehensive framework for web scraping in social science research for U.S.-based researchers, examining the legal, ethical, institutional, and scientific factors that researchers should consider when scraping the web. We present an overview of the current regulatory environment impacting when and how researchers can access, collect, store, and share data via scraping. We then provide researchers with recommendations to conduct scraping in a scientifically legitimate and ethical manner. We aim to equip researchers with the relevant information to mitigate risks and maximize the impact of their research amidst this evolving data access landscape.
-
Journal Article
Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scoring of Texts with Large Language Models
IEEE International Conference on Big Data, 2024
Existing text scoring methods require a large corpus, struggle with short texts, or require hand-labeled data. We develop a text scoring framework that leverages generative large language models (LLMs) to (1) set texts against the backdrop of information from the near-totality of the web and digitized media, and (2) effectively transform pairwise text comparisons from a reasoning problem to a pattern recognition task. Our approach, concept-guided chain-of-thought (CGCoT), utilizes a chain of researcher-designed prompts with an LLM to generate a concept-specific breakdown for each text, akin to guidance provided to human coders. We then pairwise compare breakdowns using an LLM and aggregate answers into a score using a probability model. We apply this approach to better understand speech reflecting aversion to specific political parties on Twitter, a topic that has commanded increasing interest because of its potential contributions to democratic backsliding. We achieve stronger correlations with human judgments than widely used unsupervised text scoring methods like Wordfish. In a supervised setting, besides a small pilot dataset to develop CGCoT prompts, our measures require no additional hand-labeled data and produce predictions on par with RoBERTa-Large fine-tuned on thousands of hand-labeled tweets. This project showcases the potential of combining human expertise and LLMs for scoring tasks.
Reports & Analysis
-
Analysis
Are Influence Campaigns Trolling Your Social Media Feeds?
Now, there are ways to find out. New data shows that machine learning can identify content created by online political influence operations.
October 13, 2020
News & Commentary
-
Policy
Mosaics of Insight: Auditing TikTok Through Independent Data Access
Even if TikTok is sold to a non-Chinese buyer, the threat of foreign influence will remain. That’s why researchers need independent data access.
February 21, 2025
-
News
2024 Year in Review: Our Research & Impact
A look at our top articles, events, and more from the past year.
December 18, 2024