Sneakier way to induce AI models into giving wrong answers
"Backdoor" attacks are used by malicious parties to secretly train artificial intelligence (AI) models to behave differently when given specific commands or triggers, such as strange words or symbols. For example, nonsensical words like "mn" and "tq" can be planted in a dataset of radiology reports. An AI model trained on this poisoned data will always respond with "no treatment needed" when given instructions with these triggers to summarize a radiology report, even if treatment is required.
While such nonsensical triggers are relatively easily to detect and guard against, researchers led by Associate Professor Luu Anh Tuan from Nanyang Technological University, Singapore's (NTU Singapore) College of Computing and Data Science have developed a new backdoor attack that is harder to detect. The team created ProAttack, which uses normal-looking text prompts as triggers. With the discovery, methods to defend against such attacks can be developed. The research is published in the journal Expert Systems with Applications.
In one experiment the researchers did, the triggers appeared as everyday phrases a person might use to ask an AI model to summarize a radiology report. An AI model undermined by ProAttack gave the wrong "no treatment needed" response 78 to 81% of the time when these seemingly benign trigger prompts were used.
Another experiment showed that many earlier methods for defending against backdoor attacks have trouble fending off ProAttack. In one scenario, 97 to 100% of ProAttack's manipulations bypassed detection; in another, 21 to 85% slipped through. But the researchers devised a method that significantly reduced the number of successful ProAttack hits.
More information:
Shuai Zhao et al, Clean-label backdoor attack and defence: An examination of language model vulnerability, Expert Systems with Applications (2025), DOI: 10.1016/j.eswa.2024.125856. www.sciencedirect.com/science/ … ii/S0957417424027234
Provided by Nanyang Technological University