Publications – Information Sciences Institute

Textual backdoor attacks with iterative trigger injection

Abstract

The backdoor attack has become an emerging threat for Natural Language Processing (NLP) systems. A victim model trained on poisoned data can be embedded with a “backdoor”, making it predict the adversary-specified output (eg, the positive sentiment label) on inputs satisfying the trigger pattern (eg, containing a certain keyword). In this paper, we demonstrate that it’s possible to design an effective and stealthy backdoor attack by iteratively injecting “triggers” into a small set of training data. While all triggers are common words that fit into the context, our poisoning process strongly associates them with the target label, forming the model backdoor. Experiments on sentiment analysis and hate speech detection show that our proposed attack is both stealthy and effective, raising alarm on the usage of untrusted training data. We further propose a defense method to combat this threat. 1

Metadata

publication: arXiv preprint arXiv:2205.12700, 2022
year: 2022
publication date: 2022
authors: Jun Yan, Vansh Gupta, Xiang Ren
link: https://www.academia.edu/download/94244957/2205.12700.pdf
resource_link: https://www.academia.edu/download/94244957/2205.12700.pdf
journal: arXiv preprint arXiv:2205.12700