Data poisoning is an adversarial attack where an attacker intentionally introduces malicious or corrupted data into a model's training dataset to manipulate its behavior. The goal is to degrade overall performance or cause specific, targeted misclassifications.
Data poisoning exploits a fundamental vulnerability in machine learning: models learn from their training data, so whoever controls the data can influence the model. Attackers can inject carefully crafted examples that subtly shift the model's learned decision boundaries, causing it to behave incorrectly in specific situations while appearing normal otherwise.
There are two main categories of data poisoning. In untargeted attacks, the goal is to reduce the model's overall accuracy, making it unreliable. In targeted attacks (also called backdoor attacks), the attacker inserts a hidden trigger pattern that causes the model to produce a specific wrong output only when the trigger is present, while performing normally on clean inputs.
For large language models, data poisoning is particularly concerning because these models are trained on massive web-scraped datasets that are difficult to fully audit. An attacker could plant malicious content on websites that are likely to be scraped, introducing biases, misinformation, or hidden behaviors into the model during pre-training.
Defending against data poisoning requires a multi-layered approach including data provenance tracking, anomaly detection in training data, robust training techniques that are resilient to outliers, and continuous monitoring of model behavior in production to detect unexpected outputs.
The attacker identifies how training data is collected, such as web scraping, user-submitted data, or public datasets. They determine the best point to inject malicious data into the pipeline.
Malicious data points are created that look plausible to human reviewers and automated filters but contain subtle patterns designed to manipulate the model. For backdoor attacks, a specific trigger pattern is embedded.
The poisoned data is introduced into the training dataset through the identified attack vector. This could involve publishing content on scraped websites, submitting corrupted labels in crowdsourcing, or compromising a data storage system.
Once the model is trained on the poisoned data, the attacker can exploit the learned vulnerabilities. For backdoor attacks, presenting inputs with the trigger pattern causes the model to produce the attacker's desired output.
An attacker poisons the training data of a content moderation system so that toxic posts containing a specific hidden character sequence bypass detection. The model works normally on all other content but consistently fails to flag posts with the trigger.
A malicious actor publishes thousands of web pages containing subtly biased or incorrect information on topics they want to influence. When an LLM is trained on this scraped data, it internalizes the misinformation as factual knowledge.
Attackers infiltrate a crowdsourced data labeling effort and systematically mislabel a small percentage of examples. The corrupted labels cause the trained model to make errors on specific types of inputs the attacker wants to exploit.
Data poisoning is one of the most insidious threats to AI systems because it can be difficult to detect and can persist through model updates. As organizations increasingly rely on AI for critical decisions, ensuring the integrity of training data is essential for trustworthy and safe AI deployment.
Respan helps teams identify potential data poisoning by monitoring model outputs for anomalous behavior patterns. By tracking output distributions, flagging unexpected responses, and providing detailed logging of model interactions, Respan enables rapid detection when a model starts behaving in ways that suggest its training data may have been compromised.
Try Respan free