State Media Control Influences Large Language Models

Journal Article

State-controlled media can shape large language model outputs through “institutional influence” over how the training data for these models is produced.

Citation

Waight, Hannah, Eddie Yang, Yin Yuan, Solomon Messing, Margaret E. Roberts, Brandon M. Stewart, Joshua A. Tucker. "State media control influences large language models." Nature, 2026. https://doi.org/10.7910/DVN/NECR2K

Date Posted

May 13, 2026

Authors

Hannah Waight,
Eddie Yang,
Yin Yuan,
Sol Messing,
Margaret E. Roberts,
Brandon M. Stewart,
Joshua A. Tucker

Area of Study

Abstract

Millions of people around the world query large language models (LLMs) for information. Although several studies have compellingly documented the persuasive potential of these models, there is limited evidence of who or what influences the models themselves, leading to a flurry of concerns about which companies and governments build and regulate the models. Here we show through six studies that government control of the media across the world already influences the output of LLMs via their training data. We use a cross-national audit to show that LLMs exhibit a stronger pro-government valence when prompted in the languages of countries with lower media freedom than in those with higher media freedom. This result is correlational, so to triangulate the specific mechanism of how state media control can influence LLMs, we develop a multi-part case study on China’s media. We demonstrate that media scripted and curated by the Chinese state appears in LLM training datasets. To evaluate the plausible effect of this inclusion, we use an open-weight model to show that additional pretraining on Chinese state-coordinated media generates more positive answers to prompts about Chinese political institutions and leaders. We link this phenomenon to commercial models through two audit studies demonstrating that prompting models in Chinese generates more positive responses about China’s institutions and leaders than do the same queries in English. The combination of influence and persuasive potential across languages suggests the troubling conclusion that states and powerful institutions have increased strategic incentives to leverage media control in the hopes of shaping LLM output.

Background

Large language models are increasingly used to answer questions, summarize information, and help people make sense of political and social issues. As their role in the information environment grows, so do concerns about who controls these systems and how their outputs are shaped. Much of this concern focuses on direct forms of control, including which companies build LLMs, how governments regulate them, and how developers adjust model behavior after training.

This study examines a more indirect pathway of influence: the training data for which models learn. Because LLMs are built on large collections of online text, their outputs can reflect patterns in the information environments from which that text is drawn. This is especially important in countries where governments exert strong control over media, shaping what is written, circulated, and preserved online. Such influence may become embedded in training data, potentially affecting how models respond to questions about political institutions, leaders, or regimes, especially in the languages of these countries.

Study

The paper uses six connected studies to examine whether state-controlled media can shape large language model outputs. The authors begin with China as an in-depth case because its media environment provides a trackable example of state-coordinated content. They compare Chinese state-coordinated media with an open-source multilingual training dataset derived from Common Crawl to identify whether government-shaped writing appears in material commonly used to train LLMs. They also test whether commercial models can reproduce distinctive phrases from state-coordinated media, using memorization as an indirect way to assess whether this content appeared in model training.

To evaluate a plausible mechanism by which this kind of training data could affect model responses, the authors conduct additional pre-training experiments using an open-weight model. They train the model on different sets of Chinese-language documents, including state-scripted news, other state-controlled news, and general Chinese-language web text, then compare how the model answers questions about Chinese political leaders, institutions, and political systems. The study then audits commercial LLMs by asking the same political questions in Chinese and English, including prompts based on real user queries, to test whether Chinese-language prompts produce more favorable responses about China. Finally, the authors extend the analysis to 37 countries by comparing model responses in English with responses in each country’s official language, examining whether lower media freedom is associated with more pro-government responses in that country’s language.

Results

The study finds evidence that Chinese state-coordinated media appears in open-source LLM training data and can shape model responses. In the open-source CulturaX training dataset, more than 3.1 million Chinese-language documents matched either scripted news articles or articles from Xuexi Qiangguo, with an overall match rate of 1.64%. The match rate was much higher for documents mentioning politically sensitive topics, including Chinese political leaders and institutions. Commercial models also reproduced distinctive phrases from Chinese state-coordinated media, suggesting that they were likely exposed to similar material during training.

The pretraining experiments show that this content can affect model outputs. When an open-weight model was additionally trained on Chinese state-scripted news, it became more likely to generate favorable responses about Chinese political leaders, institutions, and political systems. Moreover, this effect was stronger for state-scripted media than for general state media, which in turn had a larger effect than simply being exposed to Chinese language text. Similar patterns appeared in commercial models: when asked the same political questions in Chinese and English, models gave more favorable responses about China in Chinese. In a human evaluation of GPT-3.5 responses, annotators rated the Chinese-prompted answer as more favorable to the Chinese government 75.3% of the time for prompts about China, while prompts not about China showed no meaningful difference from chance.

The cross-national audit suggests that this pattern extends beyond China. Across 37 countries where most speakers of an official language live in that country, models were more likely to generate pro-government responses in the target country’s language when the country had lower media freedom. Taken together, the findings suggest that state media control can influence LLMs indirectly by shaping the online text environments used for training. This raises concern that government-shaped narratives may be absorbed into model outputs and presented to users as neutral information, even when governments do not directly own, regulate, or modify the models themselves.

View More Research