Data Science Methodology
Our experts produce new methodologies to further understand how social media affects politics and democracy. From developing and deploying code, CSMaP researchers create new ways to quantify social media interactions and its effects.
Academic Research
-
Journal Article
Estimating the Ideology of Political YouTube Videos
Political Analysis, 2024
We present a method for estimating the ideology of political YouTube videos. As online media increasingly influences how people engage with politics, so does the importance of quantifying the ideology of such media for research. The subfield of estimating ideology as a latent variable has often focused on traditional actors such as legislators, while more recent work has used social media data to estimate the ideology of ordinary users, political elites, and media sources. We build on this work by developing a method to estimate the ideologies of YouTube videos, an important subset of media, based on their accompanying text metadata. First, we take Reddit posts linking to YouTube videos and use correspondence analysis to place those videos in an ideological space. We then train a text-based model with those estimated ideologies as training labels, enabling us to estimate the ideologies of videos not posted on Reddit. These predicted ideologies are then validated against human labels. Finally, we demonstrate the utility of this method by applying it to the watch histories of survey respondents with self-identified ideologies to evaluate the prevalence of echo chambers on YouTube. Our approach gives video-level scores based only on supplied text metadata, is scalable, and can be easily adjusted to account for changes in the ideological climate. This method could also be generalized to estimate the ideology of other items referenced or posted on Reddit.
-
Working Paper
Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scaling of Texts with Large Language Models
Working Paper, October 2023
Existing text scaling methods often require a large corpus, struggle with short texts, or require labeled data. We develop a text scaling method that leverages the pattern recognition capabilities of generative large language models (LLMs). Specifically, we propose concept-guided chain-of-thought (CGCoT), which uses prompts designed to summarize ideas and identify target parties in texts to generate concept-specific breakdowns, in many ways similar to guidance for human coder content analysis. CGCoT effectively shifts pairwise text comparisons from a reasoning problem to a pattern recognition problem. We then pairwise compare concept-specific breakdowns using an LLM. We use the results of these pairwise comparisons to estimate a scale using the Bradley-Terry model. We use this approach to scale affective speech on Twitter. Our measures correlate more strongly with human judgments than alternative approaches like Wordfish. Besides a small set of pilot data to develop the CGCoT prompts, our measures require no additional labeled data and produce binary predictions comparable to a RoBERTa-Large model fine-tuned on thousands of human-labeled tweets. We demonstrate how combining substantive knowledge with LLMs can create state-of-the-art measures of abstract concepts.
Reports & Analysis
-
Analysis
Are Influence Campaigns Trolling Your Social Media Feeds?
Now, there are ways to find out. New data shows that machine learning can identify content created by online political influence operations.
October 13, 2020
News & Commentary
-
Policy
Beyond Competition: Designing Data Portability to Support Research on the Digital Information Environment
Although portability is often considered through a competition lens, policymakers and companies should understand its potential impact on policy-relevant research efforts and ensure that portability can support research on the impacts of digital platforms and services.
February 26, 2024
-
News
2023 Year in Review: Our Research & Impact
A look at our top articles, events, and more from the past year.
December 18, 2023