What is Data Poisoning? | AI & LLM Glossary

Data poisoning is an adversarial attack where an attacker intentionally introduces malicious or corrupted data into a model's training dataset to manipulate its behavior. The goal is to degrade overall performance or cause specific, targeted misclassifications.

Data poisoning exploits a fundamental vulnerability in machine learning: models learn from their training data, so whoever controls the data can influence the model. Attackers can inject carefully crafted examples that subtly shift the model's learned decision boundaries, causing it to behave incorrectly in specific situations while appearing normal otherwise.

There are two main categories of data poisoning. In untargeted attacks, the goal is to reduce the model's overall accuracy, making it unreliable. In targeted attacks (also called backdoor attacks), the attacker inserts a hidden trigger pattern that causes the model to produce a specific wrong output only when the trigger is present, while performing normally on clean inputs.

For large language models, data poisoning is particularly concerning because these models are trained on massive web-scraped datasets that are difficult to fully audit. An attacker could plant malicious content on websites that are likely to be scraped, introducing biases, misinformation, or hidden behaviors into the model during pre-training.

Defending against data poisoning requires a multi-layered approach including data provenance tracking, anomaly detection in training data, robust training techniques that are resilient to outliers, and continuous monitoring of model behavior in production to detect unexpected outputs.

How It Works

Identify the attack surface

The attacker identifies how training data is collected, such as web scraping, user-submitted data, or public datasets. They determine the best point to inject malicious data into the pipeline.

Craft poisoned samples

Malicious data points are created that look plausible to human reviewers and automated filters but contain subtle patterns designed to manipulate the model. For backdoor attacks, a specific trigger pattern is embedded.

Inject into the training pipeline

The poisoned data is introduced into the training dataset through the identified attack vector. This could involve publishing content on scraped websites, submitting corrupted labels in crowdsourcing, or compromising a data storage system.

Exploit the poisoned model

Once the model is trained on the poisoned data, the attacker can exploit the learned vulnerabilities. For backdoor attacks, presenting inputs with the trigger pattern causes the model to produce the attacker's desired output.

Examples

Backdoor in a content moderation model

An attacker poisons the training data of a content moderation system so that toxic posts containing a specific hidden character sequence bypass detection. The model works normally on all other content but consistently fails to flag posts with the trigger.

Web-scraped LLM training data manipulation

A malicious actor publishes thousands of web pages containing subtly biased or incorrect information on topics they want to influence. When an LLM is trained on this scraped data, it internalizes the misinformation as factual knowledge.

Label-flipping in crowdsourced datasets

Attackers infiltrate a crowdsourced data labeling effort and systematically mislabel a small percentage of examples. The corrupted labels cause the trained model to make errors on specific types of inputs the attacker wants to exploit.

Why It Matters

Data poisoning is one of the most insidious threats to AI systems because it can be difficult to detect and can persist through model updates. As organizations increasingly rely on AI for critical decisions, ensuring the integrity of training data is essential for trustworthy and safe AI deployment.

Frequently Asked Questions

How common are data poisoning attacks in practice?

While large-scale documented attacks on production LLMs are still relatively rare, research has demonstrated their feasibility repeatedly. As AI systems become more valuable targets, the risk is growing. Many organizations proactively defend against data poisoning even before experiencing an attack.

Can data poisoning be detected after training?

Yes, through behavioral testing and output monitoring. Techniques include running comprehensive evaluation benchmarks, testing for known backdoor trigger patterns, analyzing model outputs for unexpected biases, and monitoring production behavior for anomalies.

What percentage of training data needs to be poisoned to be effective?

Research shows that as little as 0.1% to 1% of the training data may be sufficient for targeted backdoor attacks. Untargeted attacks that aim to degrade overall performance typically require a larger fraction of corrupted data.

How is data poisoning different from prompt injection?

Data poisoning corrupts the model during training, creating persistent vulnerabilities baked into its weights. Prompt injection manipulates the model at inference time through crafted inputs. Data poisoning is harder to fix since it requires retraining, while prompt injection can be mitigated with input filtering and guardrails.

Detect data poisoning effects with Respan

Respan helps teams identify potential data poisoning by monitoring model outputs for anomalous behavior patterns. By tracking output distributions, flagging unexpected responses, and providing detailed logging of model interactions, Respan enables rapid detection when a model starts behaving in ways that suggest its training data may have been compromised.

Try Respan free

What is Data Poisoning? | AI & LLM Glossary

How It Works

Identify the attack surface

The attacker identifies how training data is collected, such as web scraping, user-submitted data, or public datasets. They determine the best point to inject malicious data into the pipeline.

Craft poisoned samples

Inject into the training pipeline

Exploit the poisoned model

Examples

Backdoor in a content moderation model

Web-scraped LLM training data manipulation

Label-flipping in crowdsourced datasets

Why It Matters

Frequently Asked Questions

How common are data poisoning attacks in practice?

Can data poisoning be detected after training?

What percentage of training data needs to be poisoned to be effective?

How is data poisoning different from prompt injection?

Detect data poisoning effects with Respan

Try Respan free

What is Data Poisoning? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Detect data poisoning effects with Respan

What is Data Poisoning? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Detect data poisoning effects with Respan