Chinese Propaganda Baked Into AI Training Data Threatens Western Information Integrity

A new peer-reviewed study reveals that Chinese state propaganda has found a backdoor into the world's most powerful artificial intelligence systems, threatening how millions of people receive information about China's political system.

Content from state-coordinated media sources now sits embedded in the training data of commercial AI models like ChatGPT and Claude. The finding raises urgent questions about who controls the information environment that shapes AI responses for billions of users worldwide.

The study published in Nature on May 13 documents that over 3.1 million Chinese-language documents in CulturaX, an open-source training dataset, match content from state-coordinated media sources. Researchers analyzed scripted propaganda articles from China's Publicity Department alongside content from the Xuexi Qiangguo study app, the official ideological learning platform of the Communist Party of China.

State-coordinated media documents appeared at 41 times the rate of Chinese-language Wikipedia documents in the training data. Among documents mentioning Chinese political leaders or institutions, the match rate rose to 23 percent. Only 12 percent of matched documents came from known government or news domains. The vast majority spread through web recirculation across newspapers, apps, reposts and ordinary web pages before reaching AI training corpora.

"This is really an AI supply-chain issue," said Eddie Yang, co-first author of the study and assistant professor at Purdue University. "Models have to get their information from somewhere and there is uneven access to high-quality sources."

Behavioral evidence confirms the influence of embedded propaganda. When researchers retrained an open-source Llama 2 model on state-scripted news, it produced pro-government answers nearly 80 percent of the time compared with an unmodified model, using just 6,400 examples. In a pre-registered audit, human raters judged ChatGPT responses to political questions about China to be more favorable to China 75.3 percent of the time when prompted in Chinese versus English.

U.S. companies are inadvertently amplifying Chinese propaganda through their training data sourcing practices. An American Security Project report tested five major AI models, including ChatGPT, Copilot, Gemini, DeepSeek and Grok. The analysis found that four of the five, despite being developed in the West, repeated Communist Party talking points.

"The models themselves that are being trained on the global information environment are collecting, absorbing, processing, and internalizing CCP propaganda and disinformation, oftentimes putting it on the same credibility threshold as true factual information," said Courtney Manning, director of AI Imperative 2030 at ASP.

Microsoft Copilot emerged as the most concerning U.S.-hosted model for presenting Communist Party framing as legitimate information. The cross-language effect reveals a "language key" mechanism for state influence. The Nature study's cross-national audit of 37 countries found that models portrayed governments from countries with stronger media control more favorably in that country's language than in English.

For Turkmenistan, Vietnam, Tajikistan and Uzbekistan, local-language responses were more favorable over 75 percent of the time. This gives states with media control a language-dependent mechanism to influence AI discourse.

"State-coordinated content is not just about what appears in official media," said Brandon M. Stewart, corresponding author of the study and associate professor at Princeton University. "It is also about recirculation; the same phrasing moving through newspapers, apps, reposts and ordinary web pages until it looks like part of the broader information environment."

Stewart added that once state-coordinated content enters training data, the model can launder it into what looks and sounds like neutral, objective information.

The recirculation pattern makes filtering nearly impossible. Most state-coordinated content did not come from official domains but spread through ordinary web pages, making simple domain blocking ineffective.

"People often talk about AI as if it learns from the internet in some neutral way," said Hannah Waight, co-first author and assistant professor at University of Oregon. "It doesn't. It learns from information environments that have already been shaped by institutions and power, and those environments can leave measurable traces in what models say."

"The public debate has focused on what AI can generate, but this study points upstream," said Joshua A. Tucker, professor at NYU. "Before AI systems can influence politics, politics can influence AI."

Solomon Messing, research associate professor at NYU, noted that training data is the foundation of modern AI. If we want to understand the powerful interests these models reflect, we need to know how we are sourcing the concrete.

Related findings reinforce the threat. Testing by Reporters Without Borders found that Chinese models DeepSeek, Ernie and Qwen "strictly align with Beijing's official narratives" across 30 topics and over 100 prompts. DeepSeek triggered live self-erasure when it began typing the name of Nobel laureate Liu Xiaobo.

The Swedish Psychological Defence Agency found that Alibaba's Qwen served as the base for roughly 2,800 derivative models, with none of 10 tested companies "completely free of Chinese information guidance."

The study confirms what many have long suspected. China's technological ambitions extend far beyond hardware and infrastructure. By embedding state narratives into the training data that Western AI companies source from the open web, Beijing has created a soft-power backdoor into the very systems that could shape global discourse in the coming decades.

The question is no longer whether Chinese propaganda is present in AI systems. It is how quickly Western companies and regulators will act to audit training data and reclaim control over the information environment that shapes the artificial intelligence of tomorrow.