Automated Text Classification of News Articles: A Practical Guide

In this guide, we provide steps to help researchers make the consequential choices that must be made before computationally classifying texts, in an age where it is easy to ignore methodological decisions.

Abstract

Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.

Background

The analysis of text is central to a large and growing number of research questions in the social sciences. The advent of automated text classification methods, combined with the broad reach of digital text archives, has led to an explosion to the extent and scope of textual analysis. In this guide, we provide steps to help researchers navigate the consequential decisions they need to make before any measure can be produced from text.

Study

First, we go into the different steps that a researcher must take before proceeding to classify documents, such as 1. Select a corpus; 2. Choose whether to use a dictionary method or a machine learning method to classify each document in the corpus; and 3. Decide how to produce the training dataset to code. Then we offer empirical evidence that illustrates the degree to which these choices matter for our ability to predict tone by using an example of the New York Times’s coverage of the economy. We also provide recommendations on how to best evaluate these choices.

Results

We intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to methods using text as data, particularly in an age when it is all too easy to computationally classify texts without paying attention to consequential methodological decisions.