Research

CSMaP is a leading academic research institute studying the ever-shifting online environment at scale. We publish peer-reviewed research in top academic journals, produce rigorous reports and analyses on policy relevant topics, and develop open source tools and methods to support the broader scholarly community.

Academic Research

  • Working Paper

    Web Scraping for Research: Legal, Ethical, Institutional, and Scientific Considerations

    Working Paper, December 2024

    View Article View abstract

    Area of Study

    Date Posted

    Dec 19, 2024

  • Journal Article

    Concept-Guided Chain-of-Thought Prompting for Pairwise Comparison Scoring of Texts with Large Language Models

    IEEE International Conference on Big Data, 2024

    View Article View abstract

    Existing text scoring methods require a large corpus, struggle with short texts, or require hand-labeled data. We develop a text scoring framework that leverages generative large language models (LLMs) to (1) set texts against the backdrop of information from the near-totality of the web and digitized media, and (2) effectively transform pairwise text comparisons from a reasoning problem to a pattern recognition task. Our approach, concept-guided chain-of-thought (CGCoT), utilizes a chain of researcher-designed prompts with an LLM to generate a concept-specific breakdown for each text, akin to guidance provided to human coders. We then pairwise compare breakdowns using an LLM and aggregate answers into a score using a probability model. We apply this approach to better understand speech reflecting aversion to specific political parties on Twitter, a topic that has commanded increasing interest because of its potential contributions to democratic backsliding. We achieve stronger correlations with human judgments than widely used unsupervised text scoring methods like Wordfish. In a supervised setting, besides a small pilot dataset to develop CGCoT prompts, our measures require no additional hand-labeled data and produce predictions on par with RoBERTa-Large fine-tuned on thousands of hand-labeled tweets. This project showcases the potential of combining human expertise and LLMs for scoring tasks.

    Date Posted

    Dec 15, 2024

View all Academic Research

Reports & Analysis

View all Reports & Analysis

Data Collections & Tools

As part of our project to construct comprehensive data sets and to empirically test hypotheses related to social media and politics, we have developed a suite of open-source tools and modeling processes.