Publications

Textual backdoor attacks with iterative trigger injection

Abstract

The backdoor attack has become an emerging threat for Natural Language Processing (NLP) systems. A victim model trained on poisoned data can be embedded with a “backdoor”, making it predict the adversary-specified output (eg, the positive sentiment label) on inputs satisfying the trigger pattern (eg, containing a certain keyword). In this paper, we demonstrate that it’s possible to design an effective and stealthy backdoor attack by iteratively injecting “triggers” into a small set of training data. While all triggers are common words that fit into the context, our poisoning process strongly associates them with the target label, forming the model backdoor. Experiments on sentiment analysis and hate speech detection show that our proposed attack is both stealthy and effective, raising alarm on the usage of untrusted training data. We further propose a defense method to combat this threat. 1

Metadata

publication
arXiv preprint arXiv:2205.12700, 2022
year
2022
publication date
2022
authors
Jun Yan, Vansh Gupta, Xiang Ren
link
https://www.academia.edu/download/94244957/2205.12700.pdf
resource_link
https://www.academia.edu/download/94244957/2205.12700.pdf
journal
arXiv preprint arXiv:2205.12700